## 해결할 문제 5개
- 1timestamp를 datetime으로 변환 + 인덱스로 설정
- 중복 행 제거 (09:05가 2개)
- vibration을 숫자로 변환 ('에러' → NaN)
- vibration 결측치를 선형 보간으로 채우기
- status가 'error'인 행의 vibration을 -999로 표시

In [1]:
# 이 데이터로 시작해
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'timestamp': ['2024-01-01 09:00', '2024-01-01 09:05', '2024-01-01 09:05', 
                  '2024-01-01 09:10', '2024-01-01 09:15', '2024-01-01 09:20'],
    'machine_id': ['M001', 'M001', 'M001', 'M002', 'M001', 'M002'],
    'vibration': [2.5, np.nan, 3.0, 4.5, np.nan, '에러'],
    'status': ['ok', 'OK', 'ok', 'warn', 'ok', 'error']
})

print(df)

          timestamp machine_id vibration status
0  2024-01-01 09:00       M001       2.5     ok
1  2024-01-01 09:05       M001       NaN     OK
2  2024-01-01 09:05       M001       3.0     ok
3  2024-01-01 09:10       M002       4.5   warn
4  2024-01-01 09:15       M001       NaN     ok
5  2024-01-01 09:20       M002        에러  error


### 기본적인 구성확인 

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   timestamp   6 non-null      object
 1   machine_id  6 non-null      object
 2   vibration   4 non-null      object
 3   status      6 non-null      object
dtypes: object(4)
memory usage: 324.0+ bytes


In [3]:
df.describe()

Unnamed: 0,timestamp,machine_id,vibration,status
count,6,6,4.0,6
unique,5,2,4.0,4
top,2024-01-01 09:05,M001,2.5,ok
freq,2,4,1.0,3


### timestamp를 datetime으로 변환 + 인덱스로 설정

In [4]:
# pd.to_datetime을 이용하여 변환
# df_1st = pd.to_datetime(df['timestamp']) -> 틀린이유:  변환이 목적임으로 다시 원래곳에 담아준다.
df['timestamp'] = pd.to_datetime(df['timestamp']) #강제 변환 pd.to_@ 

# 인덱스로 timestamp설정
df_1st = df.set_index('timestamp')

print(df_1st)
print(df) # 인덱스로 지정을 안한거는 0,1,2,, 이 뜬다. 

                    machine_id vibration status
timestamp                                      
2024-01-01 09:00:00       M001       2.5     ok
2024-01-01 09:05:00       M001       NaN     OK
2024-01-01 09:05:00       M001       3.0     ok
2024-01-01 09:10:00       M002       4.5   warn
2024-01-01 09:15:00       M001       NaN     ok
2024-01-01 09:20:00       M002        에러  error
            timestamp machine_id vibration status
0 2024-01-01 09:00:00       M001       2.5     ok
1 2024-01-01 09:05:00       M001       NaN     OK
2 2024-01-01 09:05:00       M001       3.0     ok
3 2024-01-01 09:10:00       M002       4.5   warn
4 2024-01-01 09:15:00       M001       NaN     ok
5 2024-01-01 09:20:00       M002        에러  error


### 중복 행 제거 (09:05가 2개)

In [5]:
df_1st

Unnamed: 0_level_0,machine_id,vibration,status
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-01-01 09:00:00,M001,2.5,ok
2024-01-01 09:05:00,M001,,OK
2024-01-01 09:05:00,M001,3.0,ok
2024-01-01 09:10:00,M002,4.5,warn
2024-01-01 09:15:00,M001,,ok
2024-01-01 09:20:00,M002,에러,error


In [6]:
df_1st['machine_id'].dtype # 'O' 는 object의 의미이다. 

dtype('O')

In [7]:
# df_2nd = df_1st.duplicated(subset=['timestamp','machine_id'],keep='first') -> 오류 원인 :timstamp가 인덱스라 중복제거가 안됌. 따라서 잠깐 reset_index진행

df_1st = df_1st.reset_index('timestamp') # 한번 초기화 하면 끝임 또 할라하면 없다고 뜸. 
df_2nd = df_1st.drop_duplicates(subset=['timestamp'],keep='first')

In [8]:
df_2nd

Unnamed: 0,timestamp,machine_id,vibration,status
0,2024-01-01 09:00:00,M001,2.5,ok
1,2024-01-01 09:05:00,M001,,OK
3,2024-01-01 09:10:00,M002,4.5,warn
4,2024-01-01 09:15:00,M001,,ok
5,2024-01-01 09:20:00,M002,에러,error


### vibration을 숫자로 변환 ('에러' → NaN)


In [9]:
# df_2nd['vibration'] = df_2nd['vibration'].astype(float) => astype은 한개로 만족안하면 안시킴 따라서 강제 형변환 사용 => pd.to_numeric 
# 위에 형변환한게 pd.to_date@가 있음. 

df_2nd['vibration']= pd.to_numeric(df_2nd['vibration'],errors = 'coerce') 
# errors='coerce' = 변환안되는놈들은 coerce를 해라 즉 NAN으로 하라는것임. 
# errors='ignore' = 무시, 원본 유지
# 기본값 (errors='raise') = 실행 안됌. 오류 
# astpye에서는 규칙이 엄격하여 이걸 적용 못함

df_2nd

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2nd['vibration']= pd.to_numeric(df_2nd['vibration'],errors = 'coerce')


Unnamed: 0,timestamp,machine_id,vibration,status
0,2024-01-01 09:00:00,M001,2.5,ok
1,2024-01-01 09:05:00,M001,,OK
3,2024-01-01 09:10:00,M002,4.5,warn
4,2024-01-01 09:15:00,M001,,ok
5,2024-01-01 09:20:00,M002,,error


### vibration 결측치를 선형 보간으로 채우기


In [10]:
df_2nd['vibration'] = df_2nd['vibration'].interpolate(method='linear')
df_2nd

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2nd['vibration'] = df_2nd['vibration'].interpolate(method='linear')


Unnamed: 0,timestamp,machine_id,vibration,status
0,2024-01-01 09:00:00,M001,2.5,ok
1,2024-01-01 09:05:00,M001,3.5,OK
3,2024-01-01 09:10:00,M002,4.5,warn
4,2024-01-01 09:15:00,M001,4.5,ok
5,2024-01-01 09:20:00,M002,4.5,error


### status가 'error'인 행의 vibration을 -999로 표시

In [11]:
df_2nd

Unnamed: 0,timestamp,machine_id,vibration,status
0,2024-01-01 09:00:00,M001,2.5,ok
1,2024-01-01 09:05:00,M001,3.5,OK
3,2024-01-01 09:10:00,M002,4.5,warn
4,2024-01-01 09:15:00,M001,4.5,ok
5,2024-01-01 09:20:00,M002,4.5,error


In [12]:
df_2nd['status'].dtype

dtype('O')

In [13]:
# df_2nd = df_2nd[df_2nd['status'] == 'error']= -999 -> 해당 컬럼이 아닌 전체적으로 바뀌게 됌. 따라서 loc를 통해 해당 컬럼만 바꾸도록 진행함. 
df_2nd.loc[df_2nd['status'] == 'error','vibration'] = -999 # [:,'vibration']이 안되는게 그럼 그냥 'vibration'가 전체 -999가 되는거임. 

df_2nd

Unnamed: 0,timestamp,machine_id,vibration,status
0,2024-01-01 09:00:00,M001,2.5,ok
1,2024-01-01 09:05:00,M001,3.5,OK
3,2024-01-01 09:10:00,M002,4.5,warn
4,2024-01-01 09:15:00,M001,4.5,ok
5,2024-01-01 09:20:00,M002,-999.0,error
