# 분석 전 기본세팅
> 분석 전 기본세팅, 필요코드 (불러오기, 내보내기, random forest, catboost, cross validation ..)

- toc: true 
- badges: true
- comments: true
- categories: [Python]
- image: images/

---

# 파일

## 파일 불러오기

``` python
data = pd.read_csv("G:/내 드라이브/bb/cc/data.csv')
```

## 파일 내보내기

``` python
test.to_csv('test.csv', index = False)
```

- 해당 소스코드가 있는 곳에 파일이 내보내진다.

```python
test.to_csv('G:/내 드라이브/Github/TIL-Blog/test.csv', index = False)
```

- 해당 경로에 파일이 내보내진다.

---

# 기본 라이브러리

In [3]:
import pandas as pd # pandas
import numpy as np # numpy

``` python


import matplotlib.pyplot as plt # matplotlib
import matplotlib

import seaborn as sns # seaborn
```

---

# 전처리

## train / validation set split

In [4]:
train = pd.read_csv('https://bit.ly/fc-ml-titanic')

In [5]:
feature = [
    'Pclass', 'Sex', 'Age', 'Fare'
]

In [6]:
label = [
    'Survived'
]

In [7]:
from sklearn.model_selection import train_test_split

* **test_size**: validation set에 할당할 비율 (20% -> 0.2)
* **shuffle**: 셔플 옵션 (기본 True)
* **random_state**: 랜덤 시드값

In [8]:
x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size=0.2, shuffle=True, random_state=30)

## 결측치 처리

In [9]:
from sklearn.impute import SimpleImputer

### 1. 수치형

칼럼 1개 처리하는 경우

``` python
train['Age'].fillna(train['Age'].mean())
```

칼럼 여러개 처리하는 경우

In [10]:
imputer = SimpleImputer(strategy='median') ## 한번에 여러개 처리. median, mean ...
result = imputer.fit_transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result

### 2. 범주형

In [11]:
#collapse-output
train = pd.read_csv('https://bit.ly/fc-ml-titanic')

컬럼 1개 처리하는 경우

``` python
train['Embarked'].fillna('S')
```

칼럼 여러개 처리하는 경우

In [12]:
imputer = SimpleImputer(strategy='most_frequent')
result = imputer.fit_transform(train[['Embarked', 'Cabin']])
train[['Embarked', 'Cabin']] = result

## Label Encoding : 문자를 수치로 변환

In [14]:
from sklearn.preprocessing import LabelEncoder

In [17]:
train['Embarked_num'] = LabelEncoder().fit_transform(train['Embarked'])

In [18]:
train['Embarked_num'].value_counts()

2    646
0    168
1     77
Name: Embarked_num, dtype: int64

## 원 핫 인코딩

In [20]:
pd.get_dummies(train['Embarked_num'], prefix = 'Embarked')

Unnamed: 0,Embarked_0,Embarked_1,Embarked_2
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


## 정규화 Normalize

In [22]:
movie = {'naver': [2, 4, 6, 8, 10], 
         'netflix': [1, 2, 3, 4, 5]}
movie = pd.DataFrame(data=movie)

In [23]:
from sklearn.preprocessing import MinMaxScaler

In [24]:
min_max_movie = MinMaxScaler().fit_transform(movie)

In [26]:
pd.DataFrame(min_max_movie, columns=['naver', 'netflix'])

Unnamed: 0,naver,netflix
0,0.0,0.0
1,0.25,0.25
2,0.5,0.5
3,0.75,0.75
4,1.0,1.0


## 표준화 Standard Scaling

In [27]:
from sklearn.preprocessing import StandardScaler

In [28]:
x = np.arange(10)
# outlier 추가
x[9] = 1000

In [None]:
scaled = StandardScaler().fit_transform(x.reshape(-1, 1))

In [29]:
x

array([   0,    1,    2,    3,    4,    5,    6,    7,    8, 1000])

---

# Model

## CatBoost + 예시

``` python
from catboost import CatBoostRegressor # 캣부스트 회귀
from catboost import CatBoostClassifier # 캣부스트 분류
```

``` python
model = CatBoostRegressor()
model.fit(X_train, y_train, silent=True)

pred = model.predict(X_test)

rmse = (np.sqrt(np.mean(mean_squared_error(y_test, pred))))
rmse
```

## Random Forest

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
```

```python
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123,max_depth=6)

rf.fit(X_train, y_train)
```

## XG BOOST, LightGBM

``` python
from xgboost import XGBRegressor
from xgboost import XGBClassifier

from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
```

---

# 평가점수

## RMSE

``` python
from sklearn.metrics import mean_squared_error

rmse = (np.sqrt(np.mean(mean_squared_error(y_test, pred))))
rmse
```

## Accuracy

```python
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predicted)
```