## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [1]:
import pandas as pd
import os

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [2]:
dataset_path = '/kaggle/input/car-crashes-severity-prediction/'

df1 = pd.read_csv(os.path.join(dataset_path, 'train.csv'))

print("The shape of the dataset is {}.\n\n".format(df1.shape))

df1.head()

The shape of the dataset is (6407, 16).




Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [3]:
#Discovring How many zero values in "Distance(mi)" column  
zero_dist_rows = df1[df1["Distance(mi)"] == 0]['ID'].count()
print("Distance(mi) ",zero_dist_rows)

#Discovring How many zero values in "Bump" column  
print("Bump ", df1[df1["Bump"] == False]['ID'].count())

#Discovring How many False values in "Crossing" column  
print("Crossing ",df1[df1["Crossing"] == False]['ID'].count())

#Discovring How many False values in "Give_Way" column  
print("Give_Way ",df1[df1["Give_Way"] == False]['ID'].count())

#Discovring How many False values in "Junction" column  
print("Junction ",df1[df1["Junction"] == False]['ID'].count())

#Discovring How many False values in "No_Exit" column  
print("No_Exit", df1[df1["No_Exit"] == False]['ID'].count())

#Discovring How many False values in "Railway" column 
print("Railway ",df1[df1["Railway"] == False]['ID'].count())

#Discovring How many False values in "Roundabout" column  
print("Roundabout ", df1[df1["Roundabout"] == False]['ID'].count())

#Discovring How many False values in "Stop" column  
print("Stop ",df1[df1["Stop"] == False]['ID'].count())

#Discovring How many False values in "Amenity" column  
print("Amenity", df1[df1["Amenity"] == False]['ID'].count())

print("conclusion from previous analysis we will drop Bump, No_Exist, Give_way, and Roundabout because they didn't give fair info almost all of it's rows values = False")

Distance(mi)  3923
Bump  6407
Crossing  5879
Give_Way  6404
Junction  4828
No_Exit 6406
Railway  6237
Roundabout  6407
Stop  5781
Amenity 6169
conclusion from previous analysis we will drop Bump, No_Exist, Give_way, and Roundabout because they didn't give fair info almost all of it's rows values = False


In [4]:
#Dropping  "Bump", "No_Exist", "Give_way", "id", and "Roundabout"  columns
df2 = df1.drop(columns = [ 'Bump', 'Give_Way', 'No_Exit', 'Roundabout'])

In [5]:
#Convert Railaway, crossing, stop, Junction, and Amentiy columns into zeros and ones 
df2['Crossing'] = df2['Crossing'] * 1
df2['Junction'] = df2['Junction'] * 1
df2['Railway'] = df2['Railway'] * 1
df2['Stop'] = df2['Stop'] * 1
df2['Amenity'] = df2['Amenity'] * 1

In [6]:
#Convert side column R to 1, L to 0
mapping_1 = {'R': 1, 'L': 0}
df2 = df2.replace({'Side': mapping_1})

In [7]:
#Replace Zeros values in Distance(mi) column into the mean of the distance
distance_mean = df2['Distance(mi)'].mean()
df2.loc[df2["Distance(mi)"] == 0, "Distance(mi)"] = distance_mean

In [8]:
print(type(df2['timestamp'][0]))

<class 'str'>


In [9]:
#change timestamp column type to Timestamp
df2["timestamp_date_type"] = pd.to_datetime(df2['timestamp'])
#split date, hour, day, month
df2["date"] = df2["timestamp_date_type"].apply(pd.Timestamp.date)
df2["hour"] = df2["timestamp_date_type"].dt.hour
df2["day"] = df2["timestamp_date_type"].dt.day
df2["month"] = df2["timestamp_date_type"].dt.month

In [10]:
#To merge weather data_set and training data_Set together make same column in both to merge on it 
# make modified_date column which contain  'year', 'month', 'day', 'hour'
# to match with weather data_set date_column
df2["year"] = df2['timestamp_date_type'].dt.year
df2['modified_date'] = pd.to_datetime(df2[['year', 'month', 'day', 'hour']])

#drop unwanted columns from df2
df3 = df2.copy()
df3 = df3.drop(columns = ['timestamp_date_type', 'date', 'hour', 'day', 'year'])

In [11]:
df3

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Railway,Stop,Amenity,Side,Severity,timestamp,month,modified_date
0,0,37.762150,-122.405660,0.044000,0,0,0,0,1,1,2,2016-03-25 15:13:02,3,2016-03-25 15:00:00
1,1,37.719157,-122.448254,0.135189,0,0,0,0,0,1,2,2020-05-05 19:23:00,5,2020-05-05 19:00:00
2,2,37.808498,-122.366852,0.135189,0,0,0,1,0,1,3,2016-09-16 19:57:16,9,2016-09-16 19:00:00
3,3,37.785930,-122.391080,0.009000,0,1,0,0,0,1,1,2020-03-29 19:48:43,3,2020-03-29 19:00:00
4,4,37.719141,-122.448457,0.135189,0,0,0,0,0,1,2,2019-10-09 08:47:00,10,2019-10-09 08:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,6402,37.740630,-122.407930,0.368000,0,0,0,0,0,1,3,2017-10-01 18:36:13,10,2017-10-01 18:00:00
6403,6403,37.752755,-122.402790,0.639000,0,1,0,0,0,1,2,2018-10-23 07:40:27,10,2018-10-23 07:00:00
6404,6404,37.726304,-122.446015,0.135189,0,1,0,0,0,1,2,2019-10-28 15:45:00,10,2019-10-28 15:00:00
6405,6405,37.808090,-122.367211,0.135189,0,1,0,0,0,1,3,2019-05-04 13:45:31,5,2019-05-04 13:00:00


In [12]:
#preprosessing on wether dataset
df_weather = pd.read_csv("/kaggle/input/car-crashes-severity-prediction/weather-sfcsv.csv")
print(df_weather.info())
df_weather_2 = df_weather.copy()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901 entries, 0 to 6900
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               6901 non-null   int64  
 1   Day                6901 non-null   int64  
 2   Month              6901 non-null   int64  
 3   Hour               6901 non-null   int64  
 4   Weather_Condition  6900 non-null   object 
 5   Wind_Chill(F)      3292 non-null   float64
 6   Precipitation(in)  3574 non-null   float64
 7   Temperature(F)     6899 non-null   float64
 8   Humidity(%)        6899 non-null   float64
 9   Wind_Speed(mph)    6556 non-null   float64
 10  Visibility(mi)     6900 non-null   float64
 11  Selected           6901 non-null   object 
dtypes: float64(6), int64(4), object(2)
memory usage: 647.1+ KB
None


In [13]:
display(df_weather_2)

Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected
0,2020,27,7,18,Fair,64.0,0.00,64.0,70.0,20.0,10.0,No
1,2017,30,9,17,Partly Cloudy,,,71.1,57.0,9.2,10.0,No
2,2017,27,6,5,Overcast,,,57.9,87.0,15.0,9.0,No
3,2016,7,9,9,Clear,,,66.9,73.0,4.6,10.0,No
4,2019,19,10,2,Fair,52.0,0.00,52.0,89.0,0.0,9.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...
6896,2018,23,1,21,Clear,,,51.1,80.0,3.5,10.0,No
6897,2019,16,6,7,Cloudy,56.0,0.00,56.0,80.0,9.0,9.0,No
6898,2017,7,2,4,Rain,,0.07,61.0,90.0,32.2,7.0,No
6899,2016,22,4,16,Mostly Cloudy,,,61.0,67.0,21.9,10.0,No


In [14]:
# fill nan values in numerical columns with it's mean accordiing to it's month 
df_weather_2 = df_weather_2.fillna(df_weather_2.groupby('Month').transform('mean'))

In [15]:
df_weather_3 = df_weather_2.copy()
# Normalize 'Wind_Chill(F)', 'Precipitation(in)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)', 'Visibility(mi)' 

wind_chill_min = df_weather_3["Wind_Chill(F)"].min()
wind_chill_max = df_weather_3["Wind_Chill(F)"].max()
wind_chill_min_max = wind_chill_max - wind_chill_min
df_weather_3["Wind_Chill(F)"] = (df_weather_3["Wind_Chill(F)"] - wind_chill_min) / wind_chill_min_max

#Precip_min = df_weather_3["Precipitation(in)"].min()
#Precip_max = df_weather_3["Precipitation(in)"].max()
#Precip_min_max = Precip_max - Precip_min
#df_weather_3["Precipitation(in)"] = (df_weather_3["Precipitation(in)"] - Precip_min) / Precip_min_max

Temp_min = df_weather_3["Temperature(F)"].min()
Temp_max = df_weather_3["Temperature(F)"].max()
Temp_min_max = Temp_max - Temp_min
df_weather_3["Temperature(F)"] = (df_weather_3["Temperature(F)"] - Temp_min) / Temp_min_max

Humid_min = df_weather_3["Humidity(%)"].min()
Humid_max = df_weather_3["Humidity(%)"].max()
Humid_min_max = Humid_max - Humid_min
df_weather_3["Humidity(%)"] = (df_weather_3["Humidity(%)"] - Humid_min) / Humid_min_max

Wind_Speed_min = df_weather_3["Wind_Speed(mph)"].min()
Wind_Speed_max = df_weather_3["Wind_Speed(mph)"].max()
Wind_Speed_min_max = Wind_Speed_max - Wind_Speed_min
df_weather_3["Wind_Speed(mph)"] = (df_weather_3["Wind_Speed(mph)"] - Wind_Speed_min) / Wind_Speed_min_max

Visibility_min = df_weather_3["Visibility(mi)"].min()
Visibility_max = df_weather_3["Visibility(mi)"].max()
Visibility_min_max = Visibility_max - Visibility_min
df_weather_3["Visibility(mi)"] = (df_weather_3["Visibility(mi)"] - Visibility_min) / Visibility_min_max

In [16]:
df_weather_3.describe()

Unnamed: 0,Year,Day,Month,Hour,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
count,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0
mean,2018.293001,15.624837,6.77525,12.789886,0.433953,0.00593,0.384981,0.650788,0.266867,0.94419
std,1.390524,8.703753,3.567982,5.874155,0.117495,0.021118,0.128461,0.179754,0.155415,0.16404
min,2016.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2017.0,8.0,4.0,8.0,0.343502,0.0,0.289855,0.544444,0.148883,1.0
50%,2019.0,15.0,7.0,14.0,0.41704,0.0,0.37037,0.666667,0.248139,1.0
75%,2020.0,23.0,10.0,17.0,0.522532,0.006667,0.465378,0.777778,0.372208,1.0
max,2020.0,31.0,12.0,23.0,1.0,0.49,1.0,1.0,1.0,1.0


In [17]:
df_weather_4 = df_weather_3.copy()

In [18]:
#craeting Date column to merge weathre_dataset with traing_dataset
df_weather_4["date"] = pd.to_datetime(df_weather_4[['Year', 'Month', 'Day']])

In [19]:
df_weather_4['modified_date'] = pd.to_datetime(df_weather_4[['Year', 'Month', 'Day', 'Hour']])

In [20]:
df_weather_5 = df_weather_4.copy()
df_weather_5 = df_weather_5.drop(columns = ['Selected', 'Day', 'Month', 'Hour'])

In [21]:
# Factorizing Weather_Condition coulmn, then normalize it 
df_weather_5['Weather_Condition'] = pd.factorize(df_weather_5['Weather_Condition'])[0]

In [22]:
# Normalize Weather_Condition after Factorizing it 
min_weather = df_weather_5['Weather_Condition'].min()
max_weather = df_weather_5['Weather_Condition'].max()
min_max_weather = max_weather - min_weather
df_weather_5['Weather_Condition'] = (df_weather_5['Weather_Condition'] - min_weather) / min_max_weather

In [23]:
df_weather_5.describe()

Unnamed: 0,Year,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
count,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0,6901.0
mean,2018.293001,0.169502,0.433953,0.00593,0.384981,0.650788,0.266867,0.94419
std,1.390524,0.137718,0.117495,0.021118,0.128461,0.179754,0.155415,0.16404
min,2016.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2017.0,0.076923,0.343502,0.0,0.289855,0.544444,0.148883,1.0
50%,2019.0,0.153846,0.41704,0.0,0.37037,0.666667,0.248139,1.0
75%,2020.0,0.192308,0.522532,0.006667,0.465378,0.777778,0.372208,1.0
max,2020.0,1.0,1.0,0.49,1.0,1.0,1.0,1.0


In [24]:
#Merginig weather data set with trainig data_set to get the whoole dataset
#Merging on modified_date in weather and modified_date in traing dataset
#whole_df = df3.merge(df_weather_5, on = 'modified_date', how='left')
whole_df = pd.merge(df3, df_weather_5, how='left', left_on = "modified_date", right_on = 'modified_date')
print(whole_df.shape)
whole_df_2 = whole_df.drop_duplicates("ID")
print(whole_df_2.shape)

(8537, 23)
(6407, 23)


In [25]:
whole_df_2.columns

Index(['ID', 'Lat', 'Lng', 'Distance(mi)', 'Crossing', 'Junction', 'Railway',
       'Stop', 'Amenity', 'Side', 'Severity', 'timestamp', 'month',
       'modified_date', 'Year', 'Weather_Condition', 'Wind_Chill(F)',
       'Precipitation(in)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)',
       'Visibility(mi)', 'date'],
      dtype='object')

In [26]:
# Extracting holiday column from holiday dataset 
from xml.dom import minidom
from datetime import datetime 

holidays_file = minidom.parse("/kaggle/input/car-crashes-severity-prediction/holidays.xml")
dates = holidays_file.getElementsByTagName('date')

month_day_list = []
month_list = []
day_list = []

for d in dates:
    month_day_list.append(d.firstChild.data[5:])
    month_list.append(int(d.firstChild.data[5:7]))
    day_list.append(int(d.firstChild.data[8:]))
    
whole_df_2["month_day"] = whole_df_2["timestamp"].apply(lambda x: x[5:10])
holiday_list = []

# creating holiday column
for md in whole_df_2["month_day"]:
    if md in month_day_list:
        holiday_list.append(1)
    else:
        holiday_list.append(0)
        
whole_df_2["holiday"] = holiday_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [27]:
print(whole_df_2.shape)
print(whole_df_2.columns)

(6407, 25)
Index(['ID', 'Lat', 'Lng', 'Distance(mi)', 'Crossing', 'Junction', 'Railway',
       'Stop', 'Amenity', 'Side', 'Severity', 'timestamp', 'month',
       'modified_date', 'Year', 'Weather_Condition', 'Wind_Chill(F)',
       'Precipitation(in)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)',
       'Visibility(mi)', 'date', 'month_day', 'holiday'],
      dtype='object')


In [28]:
whole_df_2['month'].unique()

array([ 3,  5,  9, 10,  2,  6,  4,  8, 11, 12,  1,  7])

In [29]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(whole_df_2, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['Severity'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['Severity'])
y_val = val_df['Severity']


In [30]:
whole_df_2.columns

Index(['ID', 'Lat', 'Lng', 'Distance(mi)', 'Crossing', 'Junction', 'Railway',
       'Stop', 'Amenity', 'Side', 'Severity', 'timestamp', 'month',
       'modified_date', 'Year', 'Weather_Condition', 'Wind_Chill(F)',
       'Precipitation(in)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)',
       'Visibility(mi)', 'date', 'month_day', 'holiday'],
      dtype='object')

As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [31]:
# This cell is used to select the numerical features. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
X_train = X_train[['Lat', 'Lng', 'Distance(mi)', 'Junction', 'Railway', 'Stop', 'Amenity', 'holiday','Side', 'Weather_Condition', 'Wind_Chill(F)', 'Precipitation(in)',
       'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)', 'Visibility(mi)','Year']]

X_val = X_val[['Lat', 'Lng', 'Distance(mi)', 'Junction', 'Railway', 'Stop', 'Amenity', 'holiday','Side', 'Weather_Condition', 'Wind_Chill(F)', 'Precipitation(in)',
       'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)', 'Visibility(mi)','Year']]


## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [32]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [33]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.7480499219968799


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [34]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58


In [35]:
#Dropping  "Bump", "No_Exist", "Give_way", "id", and "Roundabout"  columns
test_df_2 = test_df.drop(columns = [ 'Bump', 'Give_Way', 'No_Exit', 'Roundabout'])

In [36]:
#Convert Railaway, crossing, stop, Junction, and Amentiy columns into zeros and ones 
test_df_2['Crossing'] = test_df_2['Crossing'] * 1
test_df_2['Junction'] = test_df_2['Junction'] * 1
test_df_2['Railway'] = test_df_2['Railway'] * 1
test_df_2['Stop'] = test_df_2['Stop'] * 1
test_df_2['Amenity'] = test_df_2['Amenity'] * 1

In [37]:
#Convert side column R to 1, L to 0
mapping_1 = {'R': 1, 'L': 0}
test_df_2 = test_df_2.replace({'Side': mapping_1})

In [38]:
#Replace Zeros values in Distance(mi) column into the mean of the distance
distance_mean = test_df_2['Distance(mi)'].mean()
test_df_2.loc[test_df_2["Distance(mi)"] == 0, "Distance(mi)"] = distance_mean

In [39]:
#change timestamp column type to Timestamp
test_df_2["timestamp_date_type"] = pd.to_datetime(df2['timestamp'])

In [40]:
#split date, hour, day, month
test_df_2["date"] = test_df_2["timestamp_date_type"].apply(pd.Timestamp.date)
test_df_2["hour"] = test_df_2["timestamp_date_type"].dt.hour
test_df_2["day"] = test_df_2["timestamp_date_type"].dt.day
test_df_2["month"] = test_df_2["timestamp_date_type"].dt.month

In [41]:
#To merge weather data_set and training data_Set together make same column in both to merge on it 
# make modified_date column which contain  'year', 'month', 'day', 'hour'
# to match with weather data_set date_column
test_df_2["year"] = test_df_2['timestamp_date_type'].dt.year
test_df_2['modified_date'] = pd.to_datetime(test_df_2[['year', 'month', 'day', 'hour']])

In [42]:
#drop unwanted columns from df2
test_df_3 = test_df_2.copy()
test_df_3 = test_df_2.drop(columns = ['timestamp_date_type', 'date', 'hour', 'day', 'year'])


In [43]:
#Merginig weather data set with trainig data_set to get the whoole dataset
#Merging on modified_date in weather and modified_date in traing dataset
#whole_df = df3.merge(df_weather_5, on = 'modified_date', how='left')
whole_test_df = pd.merge(test_df_3, df_weather_5, how='left', left_on = "modified_date", right_on = 'modified_date')
print(whole_test_df.shape)

(2137, 22)


In [44]:
whole_test_df_2 = whole_test_df.drop_duplicates("ID")
print(whole_test_df_2.shape)

(1601, 22)


In [45]:
# creating holiday column
whole_test_df_2["month_day"] = whole_df_2["timestamp"].apply(lambda x: x[5:10])

holiday_test_list = []
for md in whole_test_df_2["month_day"]:
    if md in month_day_list:
        holiday_test_list.append(1)
    else:
        holiday_test_list.append(0)
        
whole_test_df_2["holiday"] = holiday_test_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [46]:
whole_test_df_2

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Railway,Stop,Amenity,Side,...,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),date,month_day,holiday
0,6407,37.786060,-122.390900,0.039000,0,1,0,0,0,1,...,0.307692,0.352858,0.0122,0.450886,0.533333,0.570720,1.0,2016-03-25,03-25,0
2,6408,37.769609,-122.415057,0.202000,0,0,0,0,0,1,...,0.576923,0.387145,0.0000,0.338164,0.811111,0.545906,1.0,2020-05-05,05-05,0
4,6409,37.807495,-122.476021,0.149761,0,0,0,0,0,1,...,0.153846,0.569058,0.0000,0.420290,0.777778,0.228288,1.0,2016-09-16,09-16,0
5,6410,37.761818,-122.405869,0.149761,0,1,0,0,0,1,...,0.038462,0.402093,0.0000,0.354267,0.666667,0.248139,1.0,2020-03-29,03-29,0
6,6411,37.732350,-122.414100,0.670000,0,0,0,0,0,1,...,0.038462,0.402093,0.0000,0.354267,0.611111,0.074442,1.0,2019-10-09,10-09,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2130,8003,37.812973,-122.362335,4.460000,0,0,0,0,0,1,...,0.153846,0.569058,0.0000,0.402576,0.722222,0.228288,1.0,2018-09-11,09-11,0
2131,8004,37.761818,-122.405861,0.010000,0,1,0,0,0,1,...,0.076923,0.581465,0.0000,0.547504,0.566667,0.223325,1.0,2019-08-19,08-19,0
2132,8005,37.732260,-122.431970,0.431000,0,1,0,0,0,1,...,0.230769,0.431988,0.0000,0.386473,0.688889,0.074442,0.9,2020-02-28,02-28,0
2133,8006,37.786782,-122.390126,0.149761,1,0,0,0,0,1,...,0.038462,0.192825,0.0000,0.128824,0.833333,0.074442,1.0,2020-12-29,12-29,0


Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [47]:
X_test = whole_test_df_2.drop(columns=['ID'])

# You should update/remove the next line once you change the features used for training
X_test = X_test[['Lat', 'Lng', 'Distance(mi)', 'Junction', 'Railway', 'Stop', 'Amenity', 'holiday','Side', 'Weather_Condition', 'Wind_Chill(F)', 'Precipitation(in)',
       'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)', 'Visibility(mi)','Year']]

y_test_predicted = classifier.predict(X_test)

whole_test_df_2['Severity'] = y_test_predicted

whole_test_df_2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Junction,Railway,Stop,Amenity,Side,...,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),date,month_day,holiday,Severity
0,6407,37.78606,-122.3909,0.039,0,1,0,0,0,1,...,0.352858,0.0122,0.450886,0.533333,0.57072,1.0,2016-03-25,03-25,0,2
2,6408,37.769609,-122.415057,0.202,0,0,0,0,0,1,...,0.387145,0.0,0.338164,0.811111,0.545906,1.0,2020-05-05,05-05,0,2
4,6409,37.807495,-122.476021,0.149761,0,0,0,0,0,1,...,0.569058,0.0,0.42029,0.777778,0.228288,1.0,2016-09-16,09-16,0,2
5,6410,37.761818,-122.405869,0.149761,0,1,0,0,0,1,...,0.402093,0.0,0.354267,0.666667,0.248139,1.0,2020-03-29,03-29,0,2
6,6411,37.73235,-122.4141,0.67,0,0,0,0,0,1,...,0.402093,0.0,0.354267,0.611111,0.074442,1.0,2019-10-09,10-09,1,2


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [48]:
whole_test_df_2[['ID', 'Severity']].to_csv('/kaggle/working/submission.csv', index=False)


The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.