# Step 1 - Download the Dataset
Download the Dataset from the following link:
https://www.kaggle.com/felixzhao/productdemandforecasting

# Step 2 - Read the Dataset
Read the dataset into a Pandas Dataframe.
Does the dataset include any missing values? If so, drop them.
Hint: Pandas can do that with one line of code!

In [79]:
import pandas as pd 
df = pd.read_csv("Historical Product Demand.csv").dropna()
df.head(5)

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand
0,Product_0993,Whse_J,Category_028,2012/7/27,100
1,Product_0979,Whse_J,Category_028,2012/1/19,500
2,Product_0979,Whse_J,Category_028,2012/2/3,500
3,Product_0979,Whse_J,Category_028,2012/2/9,500
4,Product_0979,Whse_J,Category_028,2012/3/2,500


# Step 3 - Extract Features
Exclude the region and date from the considered features.
Hint: You can choose to use all the features.

In [80]:
# Description of the data
df.describe()

Unnamed: 0,Product_Code,Warehouse,Product_Category,Date,Order_Demand
count,1037336,1037336,1037336,1037336,1037336
unique,2160,4,33,1729,3749
top,Product_1359,Whse_J,Category_019,2013/9/27,1000
freq,16936,764447,470266,2075,112263


In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1037336 entries, 0 to 1048574
Data columns (total 5 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   Product_Code      1037336 non-null  object
 1   Warehouse         1037336 non-null  object
 2   Product_Category  1037336 non-null  object
 3   Date              1037336 non-null  object
 4   Order_Demand      1037336 non-null  object
dtypes: object(5)
memory usage: 47.5+ MB


In [82]:
#extract target and drop from df
target_feature = ['Order_Demand']
y = df[target_feature]

#drop order demand
df.drop(target_feature, axis=1, inplace=True)

# Step 4 - Perform Preprocessing
Perform any needed pre-processing on the chosen features including:
Scaling.
Encoding.
Dealing with Nan values.
Hint:
Use only the preprocessing steps you think are useful.

In [83]:
# finding the percentage of missing value
print("Number of attributes with null vaules: ", df.isnull().any().sum())
print("Percentage of missing values: ",df.isnull().any(axis=1).sum()/len(df)*100)

Number of attributes with null vaules:  0
Percentage of missing values:  0.0


In [84]:
#feature engineering, convert date to year / month
df['Date'] = pd.to_datetime(df['Date'])

#convert to year and month
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month

#drop date
df.drop('Date', axis=1, inplace=True)

In [85]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label encoding for 'Product_Code'
label_encoder = LabelEncoder()
df['Product_Code'] = label_encoder.fit_transform(df['Product_Code'])

# One-hot encoding for 'Warehouse' and 'Product_Category'
df = pd.get_dummies(df, columns=['Warehouse', 'Product_Category', 'month'])

In [86]:
X = df

# Step 5 - Split the Data
Split your data as follows:
80% training set
10% validation set
10% test set

In [87]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size = 0.2)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size = 0.5)

# Step 6 - Training K-Nearest Neighbor (KNN) Regression
Use a KNN regressor model to train your data.
Choose the best k for the KNN algorithm by trying different values and validating performance on the validation set.
Regression Metrics
Print the R-squared score of your final KNN regressor.

In [90]:
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.metrics import r2_score

scores = []
results = 0
best_score = 0
neighbours = range(4,5)

#for i in neighbours:
    
knn = KNeighborsRegressor(n_neighbors=4).fit(X_train, y_train)

result = knn.score(X_test, y_test)
scores.append(round(result,2))
y_pred = knn.predict(X_test)
r_squared = r2_score(y_test, y_pred)

#    if result > best_score:
#        best_score = result
#        best_k = i
#        best_r2 = r_squared
#        bestmodel = knn

print(scores)
print("Best score: ", best_score)
#print("Best k:", best_k)
#print("Best r2: ", best_r2)

KeyboardInterrupt: 

# Step 7 - Challenge Yourself (Optional)
Repeat step 6 for a different regression modelling technique.