# **Weekend Project (TPS)**
## Machine Learning Section (week 3)

### Group Members:
* Shaima Alharbi
* Shaikha AlBilais

# **Project Specifications**

## TPS Feb 2021
Starter Notebook

### Deleverables
1. EDA
    - What's going on?
    - Show me the data...
2. Model
    - Baseline...
    - Simple...
    - Evaluation...
    - Improvement...
3. RAPIDS Bonus
    - Apply RAPIDS ([Starter Notebook](https://www.kaggle.com/tunguz/tps-feb-2021-rapids-starter))
    - Replace pandas with cuDF & sklearn with cuML
    
    
#### Troubleshooting
- [Data](https://www.kaggle.com/c/tabular-playground-series-feb-2021/data)
- [Overview](https://www.kaggle.com/c/tabular-playground-series-feb-2021/overview)
- [RF Starter Notebook](https://www.kaggle.com/warobson/tps-feb-2021-rf-starter)
- [ML repo on GitHub](https://github.com/gumdropsteve/intro_to_machine_learning)
- [Most simple RAPIDS Notebook submission](https://www.kaggle.com/warobson/simple-rapids-live) (Has stuff like `train_test_split()` with cuml..)

# **Libraries Importing**

In [None]:
import pandas as pd
import numpy as np
import cudf

import seaborn as sns
import matplotlib.pyplot as plt

from cuml.metrics import r2_score
from cuml.metrics import mean_squared_error
from cuml.ensemble import RandomForestRegressor

sns.set_palette('husl')

# **Data Loading**

In [None]:
train = cudf.read_csv("/kaggle/input/tabular-playground-series-feb-2021/train.csv")
test = cudf.read_csv("/kaggle/input/tabular-playground-series-feb-2021/test.csv")
sample_submission = cudf.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv')

In [None]:
train.tail(3)

In [None]:
test.tail(3)

In [None]:
sample_submission.tail(3)

# EDA

## Data Exploring

In [None]:
train.shape , test.shape 

In [None]:
train.columns

In [None]:
train.describe()

In [None]:
train.info()

In [None]:
np.sum(train.isna())

# Data Visulization

In [None]:
sns.pairplot(train.to_pandas().sample(500))

In [None]:
df_train_sample=train.to_pandas().sample(1000) #visulize only 1000 samples 

In [None]:
plt.figure(figsize=(27,25))
plt.subplot(4, 3 , 1)
df_train_sample.cat0.value_counts().plot.pie(explode= (0.05 , 0), autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14},
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT0');

plt.subplot(4 , 3 , 2)
df_train_sample.cat1.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT1');


plt.subplot(4 , 3 , 3)
df_train_sample.cat2.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT2');

plt.subplot(4 , 3 , 4)
df_train_sample.cat3.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT3');

plt.subplot(4, 3 , 5)
df_train_sample.cat4.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT4');

plt.subplot(4 , 3 , 6)
df_train_sample.cat5.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT5');

plt.subplot(4 , 3 , 7)
df_train_sample.cat6.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT6');

plt.subplot(4 , 3 , 8)
df_train_sample.cat7.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT7');

plt.subplot(4 , 3 , 9)
df_train_sample.cat8.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT8');


plt.subplot(4 , 3 , 10)
df_train_sample.cat9.value_counts().plot.pie(autopct='%1.1f%%', startangle=45 , textprops={'fontsize': 14} , 
                                  wedgeprops = {"edgecolor" : "black",'linewidth': 1,'antialiased': True}).set(title = 'CAT9')

### About the Above Pie Charts:
* We noticed that A values are the highest amount in most of the columns among all other values. 
* CAT9 & CAT8 are having large diversities in thier values, especially CAT9 

In [None]:
sns.histplot(data=df_train_sample, x="target").set(title = 'Distribution the target');

### About the Above Histogram:
* The distibution of the target show that it's reaching its highest values between 8 and 9.
* We think that there are some outliers near 4 and 5.

### Checking for Null Values

In [None]:
np.sum(train.isna())

### Checking for Outliers

In [None]:
trainp = train.to_pandas()
fig=plt.figure(figsize=(25,11))
col=['id','target']
sns.boxplot(data=trainp.drop(columns=col,axis=1))
plt.title('Train Outliers Before Cleaning')
plt.show()

## Data Cleaning

* We want to see the difference after removing the outliers:

In [None]:
before = len(train)
print('Data length before removing the outliers = ', before)

In [None]:
train= train[(train['cont0']>train['cont0'].quantile(.05))&
      (train['cont2']>train['cont2'].quantile(.05))&
      (train['cont2']<train['cont2'].quantile(.95))&
      (train['cont6']<train['cont6'].quantile(.95))&
      (train['cont8']<train['cont8'].quantile(.95))&     
      (train['target']<train['target'].quantile(.95))&
      (train['target']>train['target'].quantile(.05))]
train

In [None]:
fig=plt.figure(figsize=(25,11))
col=['id','target']
sns.boxplot(data=train.to_pandas().drop(columns=col,axis=1))
plt.title('Train Outliers After Cleaning')
plt.show()

In [None]:
after = len(train) #after removing outliers 
print('Data length after removing the outliers = ', after)

### About the Above Box Plot:
* We noticed increasing in the outliers, but we assumed that these became closer after removing the selected outliers.

## Data Splitting

In [None]:
train.to_pandas().corr() #to know the suitable fetures to be split

* Getting the Dummies and Split Data:

In [None]:
from cuml.preprocessing import train_test_split

X = train.drop('target', axis=1)
X = cudf.get_dummies(X)

y = train.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

## Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
standrd = StandardScaler()
col=X_train.columns
X_train = standrd.fit_transform(X_train.to_pandas()) # Switching between pandas & rapids
X_test = standrd.transform(X_test.to_pandas()) # Switching between pandas & rapids
print('The scaled data are:')
X_train,X_test

In [None]:
# Switching between pandas & rapids
t= pd.DataFrame(X_train)
X_train=cudf.DataFrame.from_pandas(t)

In [None]:
# Switching between pandas & rapids 
t_test= pd.DataFrame(X_test)
X_test=cudf.DataFrame.from_pandas(t_test)

# **Data Modeling**

In [None]:
def baseline_model(n_preds, pred):
    return cudf.Series([pred for n in range(n_preds)])

## Baseline Model

In [None]:
baseline_preds = baseline_model(len(y_test), np.mean(y_train))
print('Baseline Predections Are:')
baseline_preds

In [None]:
bl_mse = mean_squared_error(y_true=y_test,y_pred=baseline_preds,squared=False)
print('Baseline Mean Squared Error = ', bl_mse)

## Random Forest Regressor Model

In [None]:
for n in X_train.columns:
    X_train[n]=X_train[n].astype(np.float32)

In [None]:
rfr =RandomForestRegressor()

In [None]:
rfr.fit(X_train, y_train)
print('Fit completed.')

In [None]:
pred_rfr = rfr.predict(X_test)
print('Random Forest Regressor Predections Are:')
pred_rfr

In [None]:
rfr_rmse =mean_squared_error(y_true=y_test.astype(np.float64),
                   y_pred=pred_rfr.astype(np.float64),
                   squared=False)
print('Random Forest Regressor Mean Squared Error = ', rfr_rmse)

# Data Optimization

In [None]:
# Switching Between pandas and rapids
X_train_pandas=X_train.to_pandas()
y_train_pandas=y_train.to_pandas()

# X_train_tcd=cudf.DataFrame.from_pandas(t)

## Grid SearchCV

In [None]:
# from sklearn.ensemble import RandomForestRegressor
# rfr  = RandomForestRegressor()

# from sklearn.model_selection import GridSearchCV
# p_grid = {'max_features': ['auto', 'sqrt', 'log2']}
# grid = GridSearchCV(rfr, p_grid,cv=10)

In [None]:
# grid.fit(X_train_pandas, y_train_pandas)
# print('Fit completed.')

In [None]:
# best = grid.best_params_
# print('The best parameters for the model are:', best)

# **Model Selection**

* We noticed that the Mean Squared Error (MSE) has decreased after removing the outliers and scaling the data.
* Also, we tried different parametes inside the Random Forest Regressor model to have the best result.
* MSE Before: around 0.7351
* MSE After: around 0.7322

However,
* We tried to obtain the GridSearchCV to decrease the MSE of the used model (Random Forest Regressor).
* The fiiting of the GridSearchCV took so long, and the notebook's CPU became full and bussy.
* So, we decided to remain on the last result that we came with for the Random Forest Regressor model and with the same parameters that we used.

# To submit the result in Kaggle

In [None]:
%%time

# data load
train = cudf.read_csv("/kaggle/input/tabular-playground-series-feb-2021/train.csv")
test = cudf.read_csv("/kaggle/input/tabular-playground-series-feb-2021/test.csv")
sample_submission = cudf.read_csv('../input/tabular-playground-series-feb-2021/sample_submission.csv')

# data prep
X = train.drop('target', axis=1)
X = cudf.get_dummies(X)

y = train.target

test = cudf.get_dummies(test)
test['cat6_G'] = 0  # fix lack of Gs in test data
for n in X.columns:
    X[n]=X[n].astype(np.float32)
# modeling
rfr = RandomForestRegressor()
rfr.fit(X,y)
 
rf_preds =rfr.predict(test)

# save results & submit
sample_submission['target'] = rf_preds

sample_submission.to_csv('submission.csv', index=False)