## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [1]:
import pandas as pd
import os

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [2]:
dataset_path = '/kaggle/input/car-crashes-severity-prediction/'

df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))
dw = pd.read_csv(os.path.join(dataset_path, 'weather-sfcsv.csv'))
print("The shape of the dataset is {}.\n\n".format(df.shape))

df.head()

The shape of the dataset is (6407, 16).




Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [3]:
df.drop(columns='ID').describe()

Unnamed: 0,Lat,Lng,Distance(mi),Severity
count,6407.0,6407.0,6407.0,6407.0
mean,37.765653,-122.40599,0.135189,2.293429
std,0.032555,0.028275,0.39636,0.521225
min,37.609619,-122.51044,0.0,1.0
25%,37.737096,-122.41221,0.0,2.0
50%,37.768238,-122.404835,0.0,2.0
75%,37.787813,-122.392477,0.041,3.0
max,37.825626,-122.349734,6.82,4.0


The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [4]:
# from sklearn.model_selection import train_test_split

# train_df, val_df = train_test_split(df, test_size=0.2, random_state=42) # Try adding `stratify` here

# X_train = train_df.drop(columns=['ID', 'Severity'])
# y_train = train_df['Severity']

# X_val = val_df.drop(columns=['ID', 'Severity'])
# y_val = val_df['Severity']
# print(X_train)

In [5]:
 df[['Year','Month','Day']] = df['timestamp'].astype(str).str.split('-',expand=True)
df[['Day','Hour']] = df['Day'].str.split(' ',expand=True)
df[['Hour','x','y']] = df['Hour'].str.split(':',expand=True)



df

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Amenity,Side,Severity,timestamp,Year,Month,Day,Hour,x,y
0,0,37.762150,-122.405660,False,0.044,False,False,False,False,False,...,True,R,2,2016-03-25 15:13:02,2016,03,25,15,13,02
1,1,37.719157,-122.448254,False,0.000,False,False,False,False,False,...,False,R,2,2020-05-05 19:23:00,2020,05,05,19,23,00
2,2,37.808498,-122.366852,False,0.000,False,False,False,False,False,...,False,R,3,2016-09-16 19:57:16,2016,09,16,19,57,16
3,3,37.785930,-122.391080,False,0.009,False,False,True,False,False,...,False,R,1,2020-03-29 19:48:43,2020,03,29,19,48,43
4,4,37.719141,-122.448457,False,0.000,False,False,False,False,False,...,False,R,2,2019-10-09 08:47:00,2019,10,09,08,47,00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,False,0.368,False,False,False,False,False,...,False,R,3,2017-10-01 18:36:13,2017,10,01,18,36,13
6403,6403,37.752755,-122.402790,False,0.639,False,False,True,False,False,...,False,R,2,2018-10-23 07:40:27,2018,10,23,07,40,27
6404,6404,37.726304,-122.446015,False,0.000,False,False,True,False,False,...,False,R,2,2019-10-28 15:45:00,2019,10,28,15,45,00
6405,6405,37.808090,-122.367211,False,0.000,False,False,True,False,False,...,False,R,3,2019-05-04 13:45:31,2019,05,04,13,45,31


In [6]:
del df['timestamp']
del df['x']
del df['y']
del df['Bump']
del df['Give_Way']
del df['No_Exit']
del df["Railway"]

df


Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,Severity,Year,Month,Day,Hour
0,0,37.762150,-122.405660,0.044,False,False,False,False,True,R,2,2016,03,25,15
1,1,37.719157,-122.448254,0.000,False,False,False,False,False,R,2,2020,05,05,19
2,2,37.808498,-122.366852,0.000,False,False,False,True,False,R,3,2016,09,16,19
3,3,37.785930,-122.391080,0.009,False,True,False,False,False,R,1,2020,03,29,19
4,4,37.719141,-122.448457,0.000,False,False,False,False,False,R,2,2019,10,09,08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368,False,False,False,False,False,R,3,2017,10,01,18
6403,6403,37.752755,-122.402790,0.639,False,True,False,False,False,R,2,2018,10,23,07
6404,6404,37.726304,-122.446015,0.000,False,True,False,False,False,R,2,2019,10,28,15
6405,6405,37.808090,-122.367211,0.000,False,True,False,False,False,R,3,2019,05,04,13


In [7]:
df['Crossing']=df['Crossing'].astype(int)
df['Junction']=df['Junction'].astype(int)
# df['Railway']=df['Railway'].astype(int)
df['Roundabout']=df['Roundabout'].astype(int)
df['Stop']=df['Stop'].astype(int)
df['Amenity']=df['Amenity'].astype(int)
df

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,Severity,Year,Month,Day,Hour
0,0,37.762150,-122.405660,0.044,0,0,0,0,1,R,2,2016,03,25,15
1,1,37.719157,-122.448254,0.000,0,0,0,0,0,R,2,2020,05,05,19
2,2,37.808498,-122.366852,0.000,0,0,0,1,0,R,3,2016,09,16,19
3,3,37.785930,-122.391080,0.009,0,1,0,0,0,R,1,2020,03,29,19
4,4,37.719141,-122.448457,0.000,0,0,0,0,0,R,2,2019,10,09,08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368,0,0,0,0,0,R,3,2017,10,01,18
6403,6403,37.752755,-122.402790,0.639,0,1,0,0,0,R,2,2018,10,23,07
6404,6404,37.726304,-122.446015,0.000,0,1,0,0,0,R,2,2019,10,28,15
6405,6405,37.808090,-122.367211,0.000,0,1,0,0,0,R,3,2019,05,04,13


In [8]:



df['Side'] = df['Side'].astype('category')
cat_columns = df.select_dtypes(['category']).columns
print(cat_columns)
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
df

Index(['Side'], dtype='object')


Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,Severity,Year,Month,Day,Hour
0,0,37.762150,-122.405660,0.044,0,0,0,0,1,1,2,2016,03,25,15
1,1,37.719157,-122.448254,0.000,0,0,0,0,0,1,2,2020,05,05,19
2,2,37.808498,-122.366852,0.000,0,0,0,1,0,1,3,2016,09,16,19
3,3,37.785930,-122.391080,0.009,0,1,0,0,0,1,1,2020,03,29,19
4,4,37.719141,-122.448457,0.000,0,0,0,0,0,1,2,2019,10,09,08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368,0,0,0,0,0,1,3,2017,10,01,18
6403,6403,37.752755,-122.402790,0.639,0,1,0,0,0,1,2,2018,10,23,07
6404,6404,37.726304,-122.446015,0.000,0,1,0,0,0,1,2,2019,10,28,15
6405,6405,37.808090,-122.367211,0.000,0,1,0,0,0,1,3,2019,05,04,13


In [9]:




df.fillna(df.mean(),inplace=True)
df.isnull().sum()

ID              0
Lat             0
Lng             0
Distance(mi)    0
Crossing        0
Junction        0
Roundabout      0
Stop            0
Amenity         0
Side            0
Severity        0
Year            0
Month           0
Day             0
Hour            0
dtype: int64

In [10]:
dw['Weather_Condition'] = dw['Weather_Condition'].astype('category')
cat_columns = dw.select_dtypes(['category']).columns
print(cat_columns)
dw

Index(['Weather_Condition'], dtype='object')


Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected
0,2020,27,7,18,Fair,64.0,0.00,64.0,70.0,20.0,10.0,No
1,2017,30,9,17,Partly Cloudy,,,71.1,57.0,9.2,10.0,No
2,2017,27,6,5,Overcast,,,57.9,87.0,15.0,9.0,No
3,2016,7,9,9,Clear,,,66.9,73.0,4.6,10.0,No
4,2019,19,10,2,Fair,52.0,0.00,52.0,89.0,0.0,9.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...
6896,2018,23,1,21,Clear,,,51.1,80.0,3.5,10.0,No
6897,2019,16,6,7,Cloudy,56.0,0.00,56.0,80.0,9.0,9.0,No
6898,2017,7,2,4,Rain,,0.07,61.0,90.0,32.2,7.0,No
6899,2016,22,4,16,Mostly Cloudy,,,61.0,67.0,21.9,10.0,No


In [11]:
dw[cat_columns] = dw[cat_columns].apply(lambda x: x.cat.codes)
dw

Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected
0,2020,27,7,18,3,64.0,0.00,64.0,70.0,20.0,10.0,No
1,2017,30,9,17,17,,,71.1,57.0,9.2,10.0,No
2,2017,27,6,5,16,,,57.9,87.0,15.0,9.0,No
3,2016,7,9,9,0,,,66.9,73.0,4.6,10.0,No
4,2019,19,10,2,3,52.0,0.00,52.0,89.0,0.0,9.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...
6896,2018,23,1,21,0,,,51.1,80.0,3.5,10.0,No
6897,2019,16,6,7,1,56.0,0.00,56.0,80.0,9.0,9.0,No
6898,2017,7,2,4,20,,0.07,61.0,90.0,32.2,7.0,No
6899,2016,22,4,16,14,,,61.0,67.0,21.9,10.0,No


In [12]:
# del dw['Wind_Chill(F)']
# del dw['Precipitation(in)']
del dw['Selected']
dw

Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
0,2020,27,7,18,3,64.0,0.00,64.0,70.0,20.0,10.0
1,2017,30,9,17,17,,,71.1,57.0,9.2,10.0
2,2017,27,6,5,16,,,57.9,87.0,15.0,9.0
3,2016,7,9,9,0,,,66.9,73.0,4.6,10.0
4,2019,19,10,2,3,52.0,0.00,52.0,89.0,0.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...
6896,2018,23,1,21,0,,,51.1,80.0,3.5,10.0
6897,2019,16,6,7,1,56.0,0.00,56.0,80.0,9.0,9.0
6898,2017,7,2,4,20,,0.07,61.0,90.0,32.2,7.0
6899,2016,22,4,16,14,,,61.0,67.0,21.9,10.0


In [13]:
dw.fillna(dw.mean(),inplace=True)


In [14]:
dw.isnull().sum()

Year                 0
Day                  0
Month                0
Hour                 0
Weather_Condition    0
Wind_Chill(F)        0
Precipitation(in)    0
Temperature(F)       0
Humidity(%)          0
Wind_Speed(mph)      0
Visibility(mi)       0
dtype: int64

In [15]:
#dw.describe()
dw = dw.drop_duplicates(subset=['Year', 'Day','Month','Hour'], keep=False)
print(dw)


      Year  Day  Month  Hour  Weather_Condition  Wind_Chill(F)  \
0     2020   27      7    18                  3      64.000000   
1     2017   30      9    17                 17      59.762515   
2     2017   27      6     5                 16      59.762515   
3     2016    7      9     9                  0      59.762515   
8     2019   14      2    15                 14      59.762515   
...    ...  ...    ...   ...                ...            ...   
6896  2018   23      1    21                  0      59.762515   
6897  2019   16      6     7                  1      56.000000   
6898  2017    7      2     4                 20      59.762515   
6899  2016   22      4    16                 14      59.762515   
6900  2016   11     12     2                 23      59.762515   

      Precipitation(in)  Temperature(F)  Humidity(%)  Wind_Speed(mph)  \
0              0.000000            64.0         70.0             20.0   
1              0.006444            71.1         57.0         

In [16]:
import xml.etree.ElementTree as ET
import pandas as pd
#dataset_path = '/kaggle/input/car-crashes-severity-prediction/'
tree = ET.parse('/kaggle/input/car-crashes-severity-prediction/holidays.xml')
root = tree.getroot()

get_range = lambda col: range(len(col))
l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]

holiday = pd.DataFrame.from_dict(l)
holiday

Unnamed: 0,date,description
0,2012-01-02,New Year Day
1,2012-01-16,Martin Luther King Jr. Day
2,2012-02-20,Presidents Day (Washingtons Birthday)
3,2012-05-28,Memorial Day
4,2012-07-04,Independence Day
...,...,...
85,2020-09-07,Labor Day
86,2020-10-12,Columbus Day
87,2020-11-11,Veterans Day
88,2020-11-26,Thanksgiving Day


In [17]:
#holiday[['Year','Mounth','Day']] = holiday.date.str.split("-",expand=True,)
holiday[['Year','Month','Day']] = holiday['date'].astype(str).str.split('-',expand=True)
holiday

Unnamed: 0,date,description,Year,Month,Day
0,2012-01-02,New Year Day,2012,01,02
1,2012-01-16,Martin Luther King Jr. Day,2012,01,16
2,2012-02-20,Presidents Day (Washingtons Birthday),2012,02,20
3,2012-05-28,Memorial Day,2012,05,28
4,2012-07-04,Independence Day,2012,07,04
...,...,...,...,...,...
85,2020-09-07,Labor Day,2020,09,07
86,2020-10-12,Columbus Day,2020,10,12
87,2020-11-11,Veterans Day,2020,11,11
88,2020-11-26,Thanksgiving Day,2020,11,26


In [18]:
holiday.fillna(holiday.mean(),inplace=True)
df.isnull().sum()

ID              0
Lat             0
Lng             0
Distance(mi)    0
Crossing        0
Junction        0
Roundabout      0
Stop            0
Amenity         0
Side            0
Severity        0
Year            0
Month           0
Day             0
Hour            0
dtype: int64

In [19]:
holiday = holiday.drop_duplicates(subset=['Year', 'Day','Month'], keep=False)
print(holiday)

          date                            description  Year Month Day
0   2012-01-02                           New Year Day  2012    01  02
1   2012-01-16             Martin Luther King Jr. Day  2012    01  16
2   2012-02-20  Presidents Day (Washingtons Birthday)  2012    02  20
3   2012-05-28                           Memorial Day  2012    05  28
4   2012-07-04                       Independence Day  2012    07  04
..         ...                                    ...   ...   ...  ..
85  2020-09-07                              Labor Day  2020    09  07
86  2020-10-12                           Columbus Day  2020    10  12
87  2020-11-11                           Veterans Day  2020    11  11
88  2020-11-26                       Thanksgiving Day  2020    11  26
89  2020-12-25                          Christmas Day  2020    12  25

[90 rows x 5 columns]


In [20]:
holiday['description'] = holiday['description'].astype('category')
cat_column = holiday.select_dtypes(['category']).columns
print(cat_columns)
holiday[cat_column] = holiday[cat_column].apply(lambda x: x.cat.codes)

del holiday['date']

holiday

Index(['Weather_Condition'], dtype='object')


Unnamed: 0,description,Year,Month,Day
0,6,2012,01,02
1,4,2012,01,16
2,7,2012,02,20
3,5,2012,05,28
4,2,2012,07,04
...,...,...,...,...
85,3,2020,09,07
86,1,2020,10,12
87,9,2020,11,11
88,8,2020,11,26


In [21]:
df.select_dtypes(exclude=['int'])


Unnamed: 0,Lat,Lng,Distance(mi),Side,Year,Month,Day,Hour
0,37.762150,-122.405660,0.044,1,2016,03,25,15
1,37.719157,-122.448254,0.000,1,2020,05,05,19
2,37.808498,-122.366852,0.000,1,2016,09,16,19
3,37.785930,-122.391080,0.009,1,2020,03,29,19
4,37.719141,-122.448457,0.000,1,2019,10,09,08
...,...,...,...,...,...,...,...,...
6402,37.740630,-122.407930,0.368,1,2017,10,01,18
6403,37.752755,-122.402790,0.639,1,2018,10,23,07
6404,37.726304,-122.446015,0.000,1,2019,10,28,15
6405,37.808090,-122.367211,0.000,1,2019,05,04,13


In [22]:
df['Year']=df['Year'].astype('int')
df['Day']=df['Day'].astype('int')
df['Month']=df['Month'].astype('int')
df['Hour']=df['Hour'].astype('int')
result = pd.merge(df, dw, on=['Year', 'Day','Month','Hour'],how='left')
result

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,...,Month,Day,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
0,0,37.762150,-122.405660,0.044,0,0,0,0,1,1,...,3,25,15,,,,,,,
1,1,37.719157,-122.448254,0.000,0,0,0,0,0,1,...,5,5,19,,,,,,,
2,2,37.808498,-122.366852,0.000,0,0,0,1,0,1,...,9,16,19,0.0,59.762515,0.006444,62.1,80.0,9.2,10.0
3,3,37.785930,-122.391080,0.009,0,1,0,0,0,1,...,3,29,19,3.0,58.000000,0.000000,58.0,70.0,10.0,10.0
4,4,37.719141,-122.448457,0.000,0,0,0,0,0,1,...,10,9,8,3.0,58.000000,0.000000,58.0,65.0,3.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368,0,0,0,0,0,1,...,10,1,18,22.0,59.762515,0.006444,61.0,62.0,17.3,10.0
6403,6403,37.752755,-122.402790,0.639,0,1,0,0,0,1,...,10,23,7,14.0,59.762515,0.006444,57.0,72.0,6.9,10.0
6404,6404,37.726304,-122.446015,0.000,0,1,0,0,0,1,...,10,28,15,3.0,71.000000,0.000000,71.0,16.0,9.0,10.0
6405,6405,37.808090,-122.367211,0.000,0,1,0,0,0,1,...,5,4,13,3.0,63.000000,0.000000,63.0,58.0,13.0,10.0


In [23]:
holiday['Year']=holiday['Year'].astype('int')
holiday['Day']=holiday['Day'].astype('int')
holiday['Month']=holiday['Month'].astype('int')
result1 = pd.merge(result, holiday, on=['Year', 'Day','Month'],how='left')
result1

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,...,Day,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description
0,0,37.762150,-122.405660,0.044,0,0,0,0,1,1,...,25,15,,,,,,,,
1,1,37.719157,-122.448254,0.000,0,0,0,0,0,1,...,5,19,,,,,,,,
2,2,37.808498,-122.366852,0.000,0,0,0,1,0,1,...,16,19,0.0,59.762515,0.006444,62.1,80.0,9.2,10.0,
3,3,37.785930,-122.391080,0.009,0,1,0,0,0,1,...,29,19,3.0,58.000000,0.000000,58.0,70.0,10.0,10.0,
4,4,37.719141,-122.448457,0.000,0,0,0,0,0,1,...,9,8,3.0,58.000000,0.000000,58.0,65.0,3.0,10.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368,0,0,0,0,0,1,...,1,18,22.0,59.762515,0.006444,61.0,62.0,17.3,10.0,
6403,6403,37.752755,-122.402790,0.639,0,1,0,0,0,1,...,23,7,14.0,59.762515,0.006444,57.0,72.0,6.9,10.0,
6404,6404,37.726304,-122.446015,0.000,0,1,0,0,0,1,...,28,15,3.0,71.000000,0.000000,71.0,16.0,9.0,10.0,
6405,6405,37.808090,-122.367211,0.000,0,1,0,0,0,1,...,4,13,3.0,63.000000,0.000000,63.0,58.0,13.0,10.0,


In [24]:
result1.isnull().sum()

ID                      0
Lat                     0
Lng                     0
Distance(mi)            0
Crossing                0
Junction                0
Roundabout              0
Stop                    0
Amenity                 0
Side                    0
Severity                0
Year                    0
Month                   0
Day                     0
Hour                    0
Weather_Condition    1780
Wind_Chill(F)        1780
Precipitation(in)    1780
Temperature(F)       1780
Humidity(%)          1780
Wind_Speed(mph)      1780
Visibility(mi)       1780
description          6259
dtype: int64

In [25]:
result1.fillna(result1.mean(),inplace=True)
result1.isnull().sum()
result1

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,...,Day,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description
0,0,37.762150,-122.405660,0.044,0,0,0,0,1,1,...,25,15,11.326345,59.706894,0.006099,59.77316,68.547253,10.694157,9.507199,4.628378
1,1,37.719157,-122.448254,0.000,0,0,0,0,0,1,...,5,19,11.326345,59.706894,0.006099,59.77316,68.547253,10.694157,9.507199,4.628378
2,2,37.808498,-122.366852,0.000,0,0,0,1,0,1,...,16,19,0.000000,59.762515,0.006444,62.10000,80.000000,9.200000,10.000000,4.628378
3,3,37.785930,-122.391080,0.009,0,1,0,0,0,1,...,29,19,3.000000,58.000000,0.000000,58.00000,70.000000,10.000000,10.000000,4.628378
4,4,37.719141,-122.448457,0.000,0,0,0,0,0,1,...,9,8,3.000000,58.000000,0.000000,58.00000,65.000000,3.000000,10.000000,4.628378
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368,0,0,0,0,0,1,...,1,18,22.000000,59.762515,0.006444,61.00000,62.000000,17.300000,10.000000,4.628378
6403,6403,37.752755,-122.402790,0.639,0,1,0,0,0,1,...,23,7,14.000000,59.762515,0.006444,57.00000,72.000000,6.900000,10.000000,4.628378
6404,6404,37.726304,-122.446015,0.000,0,1,0,0,0,1,...,28,15,3.000000,71.000000,0.000000,71.00000,16.000000,9.000000,10.000000,4.628378
6405,6405,37.808090,-122.367211,0.000,0,1,0,0,0,1,...,4,13,3.000000,63.000000,0.000000,63.00000,58.000000,13.000000,10.000000,4.628378


As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [26]:
# This cell is used to select the numerical features. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
#X_train = X_train[['Lat', 'Lng', 'Distance(mi)']]"


# X_train=result1[["ID","Lat","Lng","Distance(mi)","Crossing","Junction","Railway","Roundabout","Stop","Amenity","Year","Month","Day","Hour","Weather_Condition","Temperature(F)","Humidity(%)","Wind_Speed(mph)","Visibility(mi)","description"]]
# X_val = X_val[['Lat', 'Lng', 'Distance(mi)']]
# X_val=result1[["ID","Lat","Lng","Distance(mi)","Crossing","Junction","Railway","Roundabout","Stop","Amenity","Year","Month","Day","Hour","Weather_Condition","Temperature(F)","Humidity(%)","Wind_Speed(mph)","Visibility(mi)","description"]]
# X_train
# X_val

In [27]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(result1, test_size=0.2, random_state=42) # Try adding `stratify` here(updtted)

X_train = train_df.drop(columns=['ID', 'Severity'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['ID', 'Severity'])
y_val = val_df['Severity']
print(X_train)

            Lat         Lng  Distance(mi)  Crossing  Junction  Roundabout  \
748   37.720890 -122.448044         0.000         1         0           0   
5720  37.727319 -122.402749         0.000         0         0           0   
1310  37.731370 -122.423590         0.161         0         0           0   
5343  37.731860 -122.418282         0.231         0         0           0   
1480  37.808498 -122.366852         0.000         0         0           0   
...         ...         ...           ...       ...       ...         ...   
3772  37.710819 -122.455711         0.000         0         0           0   
5191  37.761349 -122.392647         0.000         0         0           0   
5226  37.725182 -122.401639         0.000         0         1           0   
5390  37.769646 -122.417847         0.000         1         0           0   
860   37.778107 -122.401192         0.000         0         0           0   

      Stop  Amenity  Side  Year  ...  Day  Hour  Weather_Condition  \
748  

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [28]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)


Now let's test our classifier on the validation dataset and see the accuracy.

In [29]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.7441497659906396


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [30]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58


Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [31]:
test_df[['Year','Month','Day']] = test_df['timestamp'].astype(str).str.split('-',expand=True)
test_df[['Day','Hour']] = test_df['Day'].str.split(' ',expand=True)
test_df[['Hour','x','y']] = test_df['Hour'].str.split(':',expand=True)
test_df


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Stop,Amenity,Side,timestamp,Year,Month,Day,Hour,x,y
0,6407,37.786060,-122.390900,False,0.039,False,False,True,False,False,...,False,False,R,2016-04-04 19:20:31,2016,04,04,19,20,31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,...,False,False,R,2020-10-28 11:51:00,2020,10,28,11,51,00
2,6409,37.807495,-122.476021,False,0.000,False,False,False,False,False,...,False,False,R,2019-09-09 07:36:45,2019,09,09,07,36,45
3,6410,37.761818,-122.405869,False,0.000,False,False,True,False,False,...,False,False,R,2019-08-06 15:46:25,2019,08,06,15,46,25
4,6411,37.732350,-122.414100,False,0.670,False,False,False,False,False,...,False,False,R,2018-10-17 09:54:58,2018,10,17,09,54,58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1596,8003,37.812973,-122.362335,False,4.460,False,False,False,False,False,...,False,False,R,2020-06-26 22:32:22,2020,06,26,22,32,22
1597,8004,37.761818,-122.405861,False,0.010,False,False,True,False,False,...,False,False,R,2016-12-03 07:16:30,2016,12,03,07,16,30
1598,8005,37.732260,-122.431970,False,0.431,False,False,True,False,False,...,False,False,R,2017-02-20 06:32:44,2017,02,20,06,32,44
1599,8006,37.786782,-122.390126,False,0.000,True,False,False,False,False,...,False,False,R,2019-10-31 20:35:00,2019,10,31,20,35,00


In [32]:
del test_df['timestamp']
del test_df['x']
del test_df['y']
del test_df['Bump']
del test_df['Give_Way']
del test_df['No_Exit']
del test_df['Railway']

test_df


Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,Year,Month,Day,Hour
0,6407,37.786060,-122.390900,0.039,False,True,False,False,False,R,2016,04,04,19
1,6408,37.769609,-122.415057,0.202,False,False,False,False,False,R,2020,10,28,11
2,6409,37.807495,-122.476021,0.000,False,False,False,False,False,R,2019,09,09,07
3,6410,37.761818,-122.405869,0.000,False,True,False,False,False,R,2019,08,06,15
4,6411,37.732350,-122.414100,0.670,False,False,False,False,False,R,2018,10,17,09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1596,8003,37.812973,-122.362335,4.460,False,False,False,False,False,R,2020,06,26,22
1597,8004,37.761818,-122.405861,0.010,False,True,False,False,False,R,2016,12,03,07
1598,8005,37.732260,-122.431970,0.431,False,True,False,False,False,R,2017,02,20,06
1599,8006,37.786782,-122.390126,0.000,True,False,False,False,False,R,2019,10,31,20


In [33]:
test_df['Crossing']=test_df['Crossing'].astype(int)
test_df['Junction']=test_df['Junction'].astype(int)
# test_df['Railway']=test_df['Railway'].astype(int)
test_df['Roundabout']=test_df['Roundabout'].astype(int)
test_df['Stop']=test_df['Stop'].astype(int)
test_df['Amenity']=test_df['Amenity'].astype(int)
test_df


test_df['Side'] = test_df['Side'].astype('category')
cat_columns = test_df.select_dtypes(['category']).columns
print(cat_columns)
test_df[cat_columns] = test_df[cat_columns].apply(lambda x: x.cat.codes)
test_df



test_df.fillna(test_df.mean(),inplace=True)
test_df.isnull().sum()

Index(['Side'], dtype='object')


ID              0
Lat             0
Lng             0
Distance(mi)    0
Crossing        0
Junction        0
Roundabout      0
Stop            0
Amenity         0
Side            0
Year            0
Month           0
Day             0
Hour            0
dtype: int64

In [34]:
test_df['Year']=test_df['Year'].astype('int')
test_df['Day']=test_df['Day'].astype('int')
test_df['Month']=test_df['Month'].astype('int')
test_df['Hour']=test_df['Hour'].astype('int')
result_test = pd.merge(test_df, dw, on=['Year', 'Day','Month','Hour'],how='left')
result_test


holiday['Year']=holiday['Year'].astype('int')
holiday['Day']=holiday['Day'].astype('int')
holiday['Month']=holiday['Month'].astype('int')
result_test2 = pd.merge(result_test, holiday, on=['Year', 'Day','Month'],how='left')
result_test2
result_test2.isnull().sum()
result_test2.fillna(result_test2.mean(),inplace=True)
result_test2.isnull().sum()


ID                   0
Lat                  0
Lng                  0
Distance(mi)         0
Crossing             0
Junction             0
Roundabout           0
Stop                 0
Amenity              0
Side                 0
Year                 0
Month                0
Day                  0
Hour                 0
Weather_Condition    0
Wind_Chill(F)        0
Precipitation(in)    0
Temperature(F)       0
Humidity(%)          0
Wind_Speed(mph)      0
Visibility(mi)       0
description          0
dtype: int64

In [35]:
X_test = result_test2.drop(columns=['ID'])

# You should update/remove the next line once you change the features used for training
# X_test = X_test[['Lat', 'Lng', 'Distance(mi)']]

y_test_predicted = classifier.predict(X_test)

result_test2['Severity'] = y_test_predicted

result_test2.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Roundabout,Stop,Amenity,Side,...,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description,Severity
0,6407,37.78606,-122.3909,0.039,0,1,0,0,0,1,...,19,17.0,59.762515,0.006444,63.0,60.0,10.4,10.0,4.702703,2
1,6408,37.769609,-122.415057,0.202,0,0,0,0,0,1,...,11,3.0,65.0,0.0,65.0,56.0,5.0,9.0,4.702703,2
2,6409,37.807495,-122.476021,0.0,0,0,0,0,0,1,...,7,11.27288,59.456778,0.006415,59.726448,69.255248,10.771298,9.458228,4.702703,2
3,6410,37.761818,-122.405869,0.0,0,1,0,0,0,1,...,15,3.0,72.0,0.0,72.0,59.0,17.0,10.0,4.702703,2
4,6411,37.73235,-122.4141,0.67,0,0,0,0,0,1,...,9,22.0,59.762515,0.006444,57.0,77.0,5.8,10.0,4.702703,2


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [36]:
result_test2[['ID', 'Severity']].to_csv('/kaggle/working/submission.csv', index=False)


The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.

In [37]:
# df['Crossing']=df['Crossing'].astype(int)
# df['Give_Way']=df['Give_Way'].astype(int)
# df['Junction']=df['Junction'].astype(int)
# df['No_Exit']=df['No_Exit'].astype(int)
# df['Railway']=df['Railway'].astype(int)
# df['Roundabout']=df['Roundabout'].astype(int)
# df['Stop']=df['Stop'].astype(int)
# df['Amenity']=df['Amenity'].astype(int)
# print(df)
# import matplotlib.pyplot as plt
# df.groupby('Crossing').count().plot.bar()