# Lab 3 - Regression Tree and Kaggle competition

In this lab you will participate in a Kaggle competition:

https://www.kaggle.com/t/1552d52d5bbc474da5e024b385b9fe9e

In this competition you will train and test a **decision tree regressor** while choosing how to preprocess the data and model's parameters.

First, click the link and join the competition.

using 'kaggle.json' file from the first lab, load the competition's data:

(remember to upload it to your environment)

You suppose to have 3 files:


```
sample_submission.csv
test_set.csv
train_set.csv - which is an example file to your submission
```



In [1]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download -c decision-trees-lab-3-2023
! unzip decision-trees-lab-3-2023.zip


Downloading decision-trees-lab-3-2023.zip to /content
  0% 0.00/4.97M [00:00<?, ?B/s]
100% 4.97M/4.97M [00:00<00:00, 154MB/s]
Archive:  decision-trees-lab-3-2023.zip
  inflating: sample_submission.csv   
  inflating: test_set.csv            
  inflating: train_set.csv           


## Task 1:
Load the train set and decide how to preprocess it.

Think about missing values, redundant features, categorical values, normalization ect.



In [2]:
import pandas as pd

train_set = pd.read_csv('/content/train_set.csv', index_col='index')
test_set = pd.read_csv('/content/test_set.csv', index_col='index')
train_set.head(10)

Unnamed: 0_level_0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
179791,Vistara,UK-876,Hyderabad,Night,one,Morning,Chennai,Economy,14.25,41,6287
94264,GO_FIRST,G8-116,Bangalore,Early_Morning,one,Night,Mumbai,Economy,13.83,3,5177
140559,Vistara,UK-738,Kolkata,Evening,one,Night,Bangalore,Economy,25.67,36,8111
213532,Air_India,AI-665,Delhi,Early_Morning,one,Morning,Bangalore,Business,27.17,21,42457
6230,Indigo,6E-2097,Delhi,Early_Morning,one,Afternoon,Mumbai,Economy,7.58,32,2410
24656,Indigo,6E-2043,Delhi,Afternoon,one,Evening,Kolkata,Economy,4.17,27,4257
208126,Vistara,UK-879,Delhi,Evening,one,Night,Mumbai,Business,5.33,15,45156
43333,Vistara,UK-771,Mumbai,Early_Morning,one,Morning,Delhi,Economy,26.58,3,14028
189185,Air_India,AI-430,Chennai,Morning,one,Evening,Mumbai,Economy,8.08,15,10643
199118,Vistara,UK-822,Chennai,Morning,one,Night,Kolkata,Economy,13.0,15,15384


In [3]:
train_set.shape, test_set.shape

((210107, 11), (90046, 10))

In [4]:
train_set.isna().sum()

airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

In [5]:
train_set[train_set.duplicated()]

Unnamed: 0_level_0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [6]:
from sklearn.preprocessing import LabelEncoder


cat_features = list(train_set.select_dtypes(include=['object']).columns)
cat_features.remove('stops')

for feature in cat_features:
    label_encoder = LabelEncoder().fit(pd.concat([train_set[feature], test_set[feature]]))
    train_set[feature] = label_encoder.transform(train_set[feature])
    test_set[feature] = label_encoder.transform(test_set[feature])

In [7]:
X = train_set.drop('price', axis=1)
y = train_set['price']

In [8]:
X['stops'].unique()

array(['one', 'two_or_more', 'zero'], dtype=object)

In [9]:
nums = {
    'zero': 0,
    'one': 1,
    'two_or_more': 2,
}
X['stops'] = X['stops'].apply(lambda x: nums[x])
X['stops'].unique()

array([1, 2, 0])

In [10]:
test_set['stops'] = test_set['stops'].apply(lambda x: nums[x])

In [11]:
X['stops_duration_product'] = X['stops'] * X['duration']
test_set['stops_duration_product'] = test_set['stops'] * test_set['duration']

In [12]:
X['stops_duration_ratio'] = X['stops'] / X['duration']
test_set['stops_duration_ratio'] = test_set['stops'] / test_set['duration']

In [13]:
from sklearn.preprocessing import StandardScaler

num_features = list(X.select_dtypes(include=['int64', 'float64']).columns)
scaler = StandardScaler()
scaler.fit(pd.concat([X, test_set])[num_features])

X = pd.DataFrame(scaler.transform(X[num_features]), columns=num_features)
test_set = pd.DataFrame(scaler.transform(test_set[num_features]), columns=num_features)
X.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,stops_duration_product,stops_duration_ratio
0,1.033746,1.01165,0.241134,1.472212,0.190121,0.531627,-0.910442,0.672576,0.282117,1.105764,0.18133,-0.272138
1,-0.602682,-0.371085,-1.471431,-0.807934,0.190121,1.10579,1.382425,0.672576,0.223718,-1.696393,0.134139,-0.235095
2,1.033746,0.845253,0.811988,-0.237897,0.190121,1.10579,-1.483659,0.672576,1.869995,0.737059,1.464477,-0.814794
3,-1.148157,-0.647632,-0.329721,-0.807934,0.190121,0.531627,-1.483659,-1.486822,2.078561,-0.369055,1.633017,-0.852177
4,-0.057206,-2.269416,-0.329721,-0.807934,0.190121,-1.765028,1.382425,0.672576,-0.645305,0.442096,-0.568109,0.801211


## Task 2:
Engineer features by combining 2 or more features in non linear way. For example multiply the number of stops and duration

Create 2 new features by combining existing features. Add those features to your feature set.

## Task 3:
Create decision tree for regression using sklearn.tree.DecisionTreeRegressor library. 

Select several parameters (such as: max_depth, ccp_alpha  [etc.](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)) and tune them using [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

Train the tree on the train set data.

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [15]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

regressor = DecisionTreeRegressor()

In [29]:
params = {
    'max_depth': [3, 4, 5, 6, 7, 10, 15, 20, 30],
    'ccp_alpha': [0.0, 0.4, 0.8]
}

In [30]:
grid_search = GridSearchCV(regressor, params, cv=5)
grid_search.fit(X_train, y_train)

print("Best parameters GS:", grid_search.best_params_)
best_dt_gs = grid_search.best_estimator_

Best parameters GS: {'ccp_alpha': 0.4, 'max_depth': 20}


In [31]:
import numpy as np

scores_gs = cross_val_score(best_dt_gs, X_train, y_train, cv=5)

print(f"Mean cross-validation scores: GS - {np.mean(scores_gs)}")
print(f"Standard deviation of cross-validation scores: GS - {np.std(scores_gs)}")

Mean cross-validation scores: GS - 0.9812468718290646
Standard deviation of cross-validation scores: GS - 0.00036679367157285876


In [33]:
dec_tree = best_dt_gs

## Task 4:
Use the trained model to predict on the test set.

plot the selected tree using tree.plot_tree

Save the output predictions on a csv file to submit to the Kaggle competition. Name the file 'sample_submission', you have an example of it in a file named 'sample_submission.csv'

Make sure to keep the ids same as they are in "sample_submission.csv"

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plot_tree(dec_tree, feature_names=X_train.columns, filled=True)
plt.show()

In [32]:
test_set.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,stops_duration_product,stops_duration_ratio
0,-1.693633,0.294503,0.241134,1.472212,0.190121,-0.042537,-0.337225,0.672576,-1.10832,0.663318,-0.942267,2.597952
1,1.033746,1.079615,-0.329721,0.902176,0.190121,1.10579,-0.910442,-1.486822,0.282117,1.621951,0.18133,-0.272138
2,1.033746,0.882751,-1.471431,-0.807934,-2.321779,0.531627,-0.337225,0.672576,-1.328009,-0.369055,-1.419796,-1.491927
3,-1.148157,-0.872619,-0.900576,0.902176,0.190121,-0.616701,1.382425,-1.486822,-0.575783,0.294614,-0.511929,0.659309
4,-0.057206,-2.496747,0.241134,-0.807934,-2.321779,-1.190865,-0.910442,0.672576,-1.525451,1.105764,-1.419796,-1.491927


In [36]:
y_pred = dec_tree.predict(test_set)
len(np.unique(y_pred, return_counts=True)[0])

10329

In [37]:
len(np.unique(y_train, return_counts=True)[0])

10684

In [39]:
index = pd.read_csv('/content/test_set.csv', index_col='index').index
results_df = pd.DataFrame({'price': y_pred})
results_df.index = index
results_df.head()

Unnamed: 0_level_0,price
index,Unnamed: 1_level_1
156589,2851.0
224900,60232.0
87968,4506.466403
292192,49553.0
179722,1553.265306


In [40]:
results_df.to_csv('sample_submission.csv', index=True)

## Task 5:
Convert the notebook into html:


1.   Download the notebook by clicking the File tab
2.   Upload it to this environment
3.   Run the command:


In [None]:
jupyter nbconvert --to html Lab3.ipynb