# 19기 KNN 정규세션 과제

## KNN 구현해보기
### 1. Preprocssing / EDA
지금까지 배운 내용을 토대로 해당 데이터에 대해 자유롭게 전처리와 EDA를 진행해주세요.
### 2. KNN 구현 & 파라미터 튜닝
수업 내용 및 실습 자료를 참고하여 KNN을 구현하고 파라미터 튜닝을 하며 결과를 비교해주세요.
### 3. Evaluation
결과에 대한 평가를 진행하고, 나름의 해석을 달아주세요.

**데이터:** [blackfriday | Kaggle](https://www.kaggle.com/llopesolivei/blackfriday)

---

## 0. 데이터 불러오기

In [1]:
import pandas as pd
df = pd.read_csv("blackfriday.csv", index_col = 0)
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1001088,P00046042,F,0-17,10,A,3,0,5,17.0,,2010
1,1004493,P00347742,F,0-17,10,A,1,0,7,,,4483
2,1005302,P00048942,F,0-17,10,A,1,0,1,4.0,,7696
3,1001348,P00145242,F,0-17,10,A,3,0,2,4.0,,16429
4,1001348,P00106742,F,0-17,10,A,3,0,3,5.0,,5780


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4998 entries, 0 to 4997
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     4998 non-null   int64  
 1   Product_ID                  4998 non-null   object 
 2   Gender                      4998 non-null   object 
 3   Age                         4998 non-null   object 
 4   Occupation                  4998 non-null   int64  
 5   City_Category               4998 non-null   object 
 6   Stay_In_Current_City_Years  4998 non-null   object 
 7   Marital_Status              4998 non-null   int64  
 8   Product_Category_1          4998 non-null   int64  
 9   Product_Category_2          3465 non-null   float64
 10  Product_Category_3          1544 non-null   float64
 11  Purchase                    4998 non-null   int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 507.6+ KB


### 1. Preprocessing

In [3]:
df = df.drop(columns=["User_ID", "Product_ID"])

age_le = {y:x for x, y in enumerate(df["Age"].unique())}
df["Age"] = df["Age"].map(lambda x: age_le[x])

df = pd.concat([df, pd.get_dummies(df.Occupation, prefix="Occupation")], axis=1)
df = df.drop(columns=["Occupation"])

df = pd.concat([df, pd.get_dummies(df.City_Category, prefix="City_Category")], axis=1)
df = df.drop(columns=["City_Category"])

df["Gender"] = df["Gender"].map(lambda x: 1 if x=="M" else 0)

df["Stay_In_Current_City_Years"] = df["Stay_In_Current_City_Years"].map(lambda x: int(x.replace("+", "")))

df[["Product_Category_1", "Product_Category_2", "Product_Category_3"]] = df[["Product_Category_1", "Product_Category_2", "Product_Category_3"]].fillna(-1)

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
y = df["Purchase"]
X = df.drop(columns=["Purchase"])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.3)

In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### 2. KNN 구현/ 파라미터 튜닝

In [8]:
from sklearn.neighbors import KNeighborsRegressor

In [10]:
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

In [11]:
from sklearn.metrics import mean_squared_error

In [12]:
preds = knn.predict(X_test)
mse = mean_squared_error(y_test, preds)
print(f"MSE: {mse}")

MSE: 25414501.474773332


In [13]:
from sklearn.model_selection import GridSearchCV

In [27]:
knn = KNeighborsRegressor()

params = {
    "n_neighbors": [(i+1)*5 for i in range(100)],
    "p": [1,2]
}

In [28]:
grid_cv = GridSearchCV(knn, param_grid=params, cv=3)
grid_cv.fit(X_train, y_train)

In [29]:
grid_cv.best_params_

{'n_neighbors': 145, 'p': 1}

In [30]:
knn = KNeighborsRegressor(**grid_cv.best_params_)
knn.fit(X_train, y_train)

### 3. Evaluation

In [31]:
preds = knn.predict(X_test)
mse = mean_squared_error(y_test, preds)
print(f"MSE: {mse}")

MSE: 22866519.225606058
