In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import sklearn as sklearn

# Instructions
1. We will be conducting the entire assignment through this notebook. You will be entering your code in the cells provided, and any explanation and details asked in markdown cells. 
2. You are free to add more code and markdown cells for describing your answer, but make sure they are below the question asked and not somewhere else. 
3. The notebook needs to be submitted on LMS. You can find the submission link [here](https://lms.iiitb.ac.in/moodle/mod/assign/view.php?id=13932). 
4. The deadline for submission is **5th October, 2020 11:59PM**.

# Data import
The data required for this assignment can be downloaded from the following [link](https://www.kaggle.com/dataset/e7cff1a2c6e29e18684fe6b077d3e4c42f9a7ae6199e01463378c60fe4b4c0cc), it's hosted on kaggle. Do check directory paths on your local system.  

In [None]:
alcdata = pd.read_csv("../input/iiitb-ai511ml2020-assignment-1/Assignment/alcoholism/student-mat.csv")
fifadata = pd.read_csv("../input/iiitb-ai511ml2020-assignment-1/Assignment/fifa18/data.csv")
accidata1 = pd.read_csv("../input/iiitb-ai511ml2020-assignment-1/Assignment/accidents/accidents_2005_to_2007.csv")
accidata2 = pd.read_csv("../input/iiitb-ai511ml2020-assignment-1/Assignment/accidents/accidents_2009_to_2011.csv")
accidata3 = pd.read_csv("../input/iiitb-ai511ml2020-assignment-1/Assignment/accidents/accidents_2012_to_2014.csv")

# Part - 1
## Alcohol Consumption Data
The following data was obtained in a survey of students' math course in secondary school. It contains a lot of interesting social, gender and study information about students. 


### 1. Try to visualize correlations between various features and grades and see which features have a significant impact on grades. 
Try to engineer the three grade parameters (G1, G2 and G3) as one feature for such comparisons.



In [None]:
alcdata.info()

In [None]:
alcdata.isnull().sum().sum()

There are no null entries in the data

In [None]:
alcdata.head()

***The grades as a whole convey the same information as the three grades separately, hence G1, G2, G3 can be dropped and a new feature Average_G i.e. the mean of G1, G2, G3 is introduced***

In [None]:
alcdata["Average_G"] = alcdata[["G1","G2","G3"]].mean(axis = 1)
alcdata.drop(["G1","G2","G3"],axis = 1, inplace = True)

In [None]:
#cat_alcdata is alcdata without encoding categorical features
cat_alcdata = alcdata.copy()

alcdata.head()

In [None]:
female_alcdata = alcdata.groupby("sex").get_group("F")
male_alcdata = alcdata.groupby("sex").get_group("M")

# Let us compare how different features have impact on the grades.

Histograms and boxplots are used for the same.

In [None]:
sns.distplot(female_alcdata.Average_G,bins = 20)
sns.distplot(male_alcdata.Average_G,bins = 20)

The orange histogram corresponds to male, and blue correspondings to female. 
We see that the two histograms almost overlap but there is slight variations.

# Let's explore more using boxplots!
We are chosing boxplots because it's easier to analyse the range, compare the medians, min and max using them.

In [None]:
sns.boxplot(x='sex',y='Average_G',data=alcdata)

**From this box plot, we can infer that,**
* Mean grade for male students is slightly higher than female students. 
* The minimum and maximum grades obtained by male students are higher than those of female students.

In [None]:
sns.boxplot(x='famsize',y='Average_G',data = alcdata)

We can infer that,
* Grades of students with family size lesser than 3, have a shorter range of grades
* Maximum grade is obtained by student whose family size id greater than 3.
* Median grade of students with family size less than 3 is slightly greater than the other.

Therefore family size might have a small contribution to grades.

In [None]:
sns.boxplot(x='school',y='Average_G',data = alcdata)

We can see that,
* Students of MS school have a better minimum grade. The worst grade of MS school is still better than some of the students from GP.
* Student from GP has the best grade
* While the range might be different, the median is almost the same for both the schools

### 2. If there is a need for encoding some of the features,  how would you go  about it? 
Would you consider combining certain encodings together ?


We look at the no. of unique values of the features.
The categorical variables will return the no. of unique values while the numerical features return NaN

In [None]:
alcdata.describe(include='all').loc['unique', :]

We can see that in most of the above categorical features, no. of unique values is 2. Hence we can use **One-hot encoding for these**

**Mjob**, **Fjob, reason, guardian** are cat features that need to encoded.
Two options that we can consider:
* Label Encoding : Since it is efficient
* One-hot : When we do not want any order.
Since by having an order, we do not want to give priority to a particular job, reason,

# Let's go with one-hot encoding using get_dummies() function!

In [None]:
cat_attributes = alcdata.select_dtypes(include = ['object'])

for col in cat_attributes:    
    dumm = pd.get_dummies(alcdata[col])
    alcdata = pd.concat([alcdata, dumm], axis = 1)
    alcdata.drop(columns=[col], inplace = True)


### 3. Try to find out how family relation(famrel) and parents cohabitation(Pstatus) affect grades of students. 


### Scatter plots will hep us capture relationships between famrel and cohabitation status with grades

In [None]:
print(alcdata["famrel"].unique())
sns.catplot(x = 'famrel', y = 'Average_G', data = alcdata)

The distributions of grade corresponding to quality of family relationship 4 has the highest and least grade student.
While the density of students having a better grade is high for a better famrel.
As the distributions are almost the same, it shows that grades of students aren't much dependent on Famrel.

In [None]:
sns.catplot(x = 'Pstatus', y = 'Average_G', data = cat_alcdata)
#One point is way too below. Can be outlier

We can clearly note that there are a lot of people with good grades whose parents stay together.
Grades of students whose parents are not together are focussed around 10 while the grades of students with Pstatus = T is focussed around 13

In [None]:
sns.catplot(x = 'famrel', y = 'Average_G', hue = "Pstatus", kind = "box", data = cat_alcdata, aspect = 2)

* From this we can see that a student with a very good family relationship has the highest median grade even though his/her parents don't stay together.
* The highest of grades belongs to a student with famrel = 4 and and parents stay together.
* Surprisingly, the least grade is also obtained by the student with famrel = 4 but parents stay away.


### 4. Figure out which features in the data are skewed, and propose a way to remove skew from all such columns. 

Only numerical continuous attributes can be used to check for skew. Hence we have to filter out the numerical attributes

In [None]:
numerical_attributes = cat_alcdata.select_dtypes(include = ['int', 'float'])

### Age, Average_G, absences are the only continuous features. Out of which changing average_g will corrupt our data as it conveys information. 

In [None]:
numerical_attributes[["absences", "Average_G"]].hist(figsize = (12,5), bins = 50)

We can see that absences is skewed. To remove skew we apply log transform. But before checking skew, let's check if there are any  outliers

In [None]:
sns.catplot(x = 'absences', data = numerical_attributes[["absences"]], aspect = 2, kind ='box')

As there are a lot of outliers, min-max norm won't apply, we must use z-transform only.

In [None]:
# numerical_attributes[["absences"]] = numerical_attributes[["absences"]].replace({0: None}).dropna()
# # transformed_absences.replace(0, "nan").dropna(axis=1,how="all")
# print(numerical_attributes[["absences"]])

def ztransform(x):
    return (x - numerical_attributes.absences.mean())/numerical_attributes.absences.std()

sns.distplot(numerical_attributes[["absences"]].apply(lambda x: np.log(ztransform(x)+0.001)), bins = 30)

print("Skew before: " + str(numerical_attributes[["absences"]].apply(lambda x: ztransform(x).skew())))
print("Skew after: " + str(numerical_attributes[["absences"]].apply(lambda x: np.log(ztransform(x)+0.0001).skew())))

### We were successful at reducing the skew!

# Part - 2
## FIFA 2019  Data


### 1. Which clubs are the most economical? How did you decide that?

In [None]:
#enter code/answer in this cell. You can add more code/markdown cells below for your answer. 
fifadata.info()
fifadata.head()

Wage Value and Release Clause are of money type and hence have to be converted into float.
### The following function will preprocess the money type and will convert into float. 

In [None]:
def PreProcess(i):
    if(isinstance(i,str)):
        if(i[-1]=='M'):
            return(float(i.lstrip('€').rstrip('M'))*1000000)
        elif(i[-1]=='K'):
            return(float(i.lstrip('€').rstrip('K'))*1000)
        else:
            return(float(i.replace('€','')))
        
for col in ['Wage', 'Value', 'Release Clause']:
    fifadata[col] = fifadata[col].apply(lambda x: PreProcess(x))

### Value is an asset, Wage is an expense. I have used (Value - Wage) to determine which club is the most economical! 

In [None]:
#We have summed up value and wage of all players of a particular club and sorted to get the most economical club

club_Wage = fifadata['Wage'].groupby(fifadata['Club']).apply(lambda x : x.sum())
club_Value = fifadata['Value'].groupby(fifadata['Club']).apply(lambda x : x.sum())

In [None]:
(club_Value - club_Wage).sort_values(ascending = False)

### From the above inference, we can conclude that REAL MADRID is the most economical. 

### 2. What is the relationship between age and individual potential of the player? How does age influence the players' value? At what age does the player exhibit peak pace ?

We look at the null values first and fill them with the mean of the data column if required

In [None]:
print(fifadata["Potential"].isnull().sum())
print(fifadata["Value"].isnull().sum())
print(fifadata["SprintSpeed"].isnull().sum())

In [None]:
fifadata['SprintSpeed'] = fifadata['SprintSpeed'].fillna(fifadata['SprintSpeed'].mean())

In [None]:
# fig, axes = plt.subplots(figsize=(7,5))
# axes.set_ylabel('Potential')
# fifadata[["Age", "Potential"]].plot(x = 'Age',ax = axes, y = 'Potential', kind = 'scatter')

sns.lmplot(x = 'Age',y = 'Potential', order = 2, data = fifadata, aspect = 1.5)
sns.lmplot(x = 'Age',y = 'Value', order = 2, data = fifadata, aspect = 1.5)
sns.lmplot(x = 'Age',y = 'SprintSpeed', order = 2, data = fifadata, aspect = 1.5)

* Potential and Age have an inverse exponential relationship. 
As the age increases potential decreases.

* Age and value is almost a uniform distribution
This can be used to infer that that the value a player brings to the team doesn't necessarily depend on the age, but it does depend on potential

### 3. What skill sets are helpful in deciding a player's potential? How do the traits contribute to the players' potential? 

In [None]:
for col in fifadata.iloc[:,54:87]:
    if(np.abs(fifadata.corr()['Potential'][col]) > 0.4):
        print( str(col) + " is related to " + 'Potential and hence might be helpful in deciding Potential' )
    

# plt.figure(figsize=(10,10))
# sns.heatmap(fifadata[54:87].corr(), vmin=-1, cmap="coolwarm", annot=True)

### 4. Which features directly contribute to the wages of the players?

We analyse the features by a scatterplot of wages vs some features.
According to my intuition, Sprint speed, overall and potential must affect the wage. 

In [None]:
# sns.lmplot(x = 'Wage',y = 'SprintSpeed', order = 2, data = fifadata)
fifadata.plot(kind = 'scatter',y='Wage',x='SprintSpeed',figsize=(10,10))
fifadata.plot(kind = 'scatter',y='Wage',x='Overall',figsize=(10,10))

* SprintSpeed should technically affect the wage. But the distribution is almost uniform.
* But the wage is exponentially increasing with The overall and potentail as expected

In [None]:
fifadata.plot(kind = 'scatter',y='Wage',x='Potential',figsize=(10,10))

### 5. What is the age distribution in different clubs? Which club has most players young?

### Let's look at min max median and mean age of all the clubs to analyse age distribution.
Considering young players are of age <= 20, we can sort clubs based on no. of young players.

In [None]:
age_dist = fifadata["Age"].groupby(fifadata["Club"])
list_values = ["min", "max", "median", "mean"]
print(age_dist.agg(list_values))

club_age = fifadata['Age'].groupby(fifadata['Club']).apply(lambda x : (x<=20).sum())
club_age.sort_values(ascending = False)

## FC Nordsjælland has the most no. of young players!

# Part - 3
## UK Road Accidents Data


The UK government amassed traffic data from 2000 and 2016, recording over 1.6 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It's a huge picture of a country undergoing change.

### 1. The very first step should be to merge all the 3 subsets of the data.

In [None]:
print(accidata1.shape)
print(accidata2.shape)
print(accidata3.shape)

In [None]:
#enter code/answer in this cell. You can add more code/markdown cells below for your answer. 
accidata = pd.concat([accidata1, accidata2, accidata3])
accidata.info()

### 2. What are the number of casualties in each day of the week? Sort them in descending order. 

In [None]:
#enter code/answer in this cell. You can add more code/markdown cells below for your answer. 
fig, axes = plt.subplots(figsize=(10,5))
axes.set_ylabel('Casualities')

casualties = accidata[["Number_of_Casualties","Day_of_Week"]].groupby(["Day_of_Week"], as_index = False).sum().sort_values(by = 'Number_of_Casualties',ascending = False)
casualties["Day_of_Week"] = casualties["Day_of_Week"].map({1:"Monday",2:"Tuesday",3:"Wednesday",4:"Thursday",5:"Friday",6:"Saturday",7:"Sunday"})

casualties.plot(x = 'Day_of_Week', y = 'Number_of_Casualties', ax = axes,kind='bar',color= 'red')

Saturday and Friday have the most no. of casualties. While Sunday doesn't have that many. This shows that the day being a weekend affects the no. of casualties. 

### 3. On each day of the week, what is the maximum and minimum speed limit on the roads the accidents happened?

In [None]:
speed_limit_data = accidata[["Speed_limit", "Day_of_Week"]].groupby("Day_of_Week")
list_values = ["min", "max", "median", "mean"]
speed_limit_data.agg(list_values)

Min and max speed limit almost remains the same i.e 10 and 70 respectively for all days except for Thursday when the min speed limit is 20
This shows that speed limit is not dependent on the week of the day.

Boxplot can be used to get a better visualisation of the same!

In [None]:
sns.boxplot(x ="Day_of_Week", y = 'Speed_limit', data = accidata)

### 4. What is the importance of Light and Weather conditions in predicting accident severity? What does your intuition say and what does the data portray?

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(data=accidata[["Weather_Conditions","Accident_Severity"]], x= 'Weather_Conditions', hue = 'Accident_Severity')

### In case of weather, we expect accidents to happen in bad weather like heavy wind, snowing and etc, but the data shows that the most number of accidents happen with the highest severity when the weather is fine without high winds.

In [None]:
fig, axes = plt.subplots(figsize=(7,5))
axes.set_ylabel('Accident_Severity')
severity_light = accidata[['Accident_Severity','Weather_Conditions']].groupby(["Weather_Conditions"], as_index = False).mean().sort_values(by = "Accident_Severity")
print(severity_light)
severity_light.plot(x = 'Weather_Conditions', y = 'Accident_Severity', ax = axes, kind='bar',color= 'deepskyblue',ylim = [1,3], yticks = np.arange(0, 3, step=0.2))

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(data=accidata[["Light_Conditions","Accident_Severity"]], x= 'Light_Conditions', hue = 'Accident_Severity')

###  According to our intuition, we expect more accidents to happen at night without light, but on the contrary most accidents happened during daylight with the most severity and the least when there's darkness without street lighting.

In [None]:
fig, axes = plt.subplots(figsize=(7,5))
axes.set_ylabel('Accident_Severity')
severity_light = accidata[['Accident_Severity','Light_Conditions']].groupby(["Light_Conditions"], as_index = False).mean().sort_values(by = "Accident_Severity")
print(severity_light)
severity_light.plot(x = 'Light_Conditions', y = 'Accident_Severity', ax = axes, kind='bar',color= 'deepskyblue',ylim = [1,3], yticks = np.arange(0, 3, step=0.2))

### 5. To predict the severity of the accidents which columns do you think are unnecessary and should be dropped before implementing a regression model. Support your statement using relevant plots and hypotheses derived from them.

In [None]:
accidata.isnull().sum()

We can see that there are a lot of missing values in the dataset.

* Junction_Detail and Junction_Control and LSOA_of_Accident_Location have almost all_values missing. Hence, these two can be dropped.
* Accident index just seems to be an index and will not contribute much.

In [None]:
accidata_new = accidata.copy()

In [None]:
accidata_new.drop(columns = ['Junction_Detail','Junction_Control', 'LSOA_of_Accident_Location', 'Accident_Index'], inplace = True)

In [None]:
def split_date_get_month(d):
    if(isinstance(d,str)):
        return(int(d.split('/')[1]))
    else:
        return 1
    
def split_date_get_day(d):
    if(isinstance(d,str)):
        return(int(d.split('/')[0]))
    else:
        return 1

accidata_new['Day'] = accidata_new['Date'].apply(lambda x: split_date_get_day(x))
accidata_new['Month'] = accidata_new['Date'].apply(lambda x: split_date_get_month(x))
accidata_new[['Accident_Severity','Day','Month','Year']].head()

Let's check if these values are actually relevant to Accident_Severity

In [None]:
plt.figure(figsize=(8,5))
sns.heatmap(accidata_new[['Accident_Severity','Day','Month','Year']].corr(), vmin=-1, cmap="coolwarm", annot=True)

In [None]:
accidata_new['Time'].head()
def split_time(t):
    if(isinstance(t, str)):
        return (int(t.split(':')[0]), int(t.split(':')[1]))
    return (0,0)

accidata_new['Hours'], accidata_new['Minutes'] = accidata_new['Time'].apply(lambda x: split_time(x)[0]), accidata_new['Time'].apply(lambda x: split_time(x)[1])
accidata_new[['Accident_Severity', 'Hours', 'Minutes']].head()
sns.heatmap(accidata_new[['Hours', 'Minutes', 'Accident_Severity']].corr(), vmin=-1, cmap="coolwarm", annot=True)

We can see that the date doesn't influence severity. Hence all values of dat, time can be conveniently removed.

In [None]:
accidata_new.drop(columns = ['Day','Month','Year', 'Time', 'Date', 'Hours', 'Minutes'], inplace = True)

In [None]:
numerical_attributes = accidata_new.select_dtypes(include = ['int', 'float'])
plt.figure(figsize=(25,15))
sns.heatmap(numerical_attributes.corr(), vmin=-1, cmap="coolwarm", annot=True)

From this, we can drop highly correlated columns.

In [None]:
accidata_new.drop(columns = ['Location_Easting_OSGR','Location_Northing_OSGR','Local_Authority_(District)'], inplace = True)

In [None]:
accidata_new.columns

By intuition, we can drop Police Force and Local_Authority_(Highway) as they don't affect the severity of the accident. 

In [None]:
accidata_new.drop(columns = ['Police_Force', 'Local_Authority_(Highway)', 'Did_Police_Officer_Attend_Scene_of_Accident'], inplace = True)

In [None]:
cat = accidata_new.select_dtypes(include = ['object'])
for col in cat:
    print(col)
#     print(accidata_new[col].isnull().sum())
    print(accidata_new[col].value_counts())

Some features are overly dominant. We can drop those!

In [None]:
accidata_new.drop(columns = ['Special_Conditions_at_Site', 'Pedestrian_Crossing-Human_Control', 'Carriageway_Hazards'], inplace = True)

In [None]:
accidata_new.columns

Fill all the null values with the most frequent category.

In [None]:
accidata_new = accidata_new.apply(lambda x:x.fillna(x.value_counts().index[0]))

Encode all the categorical variables using mean encoding because 

In [None]:
cat = accidata_new.select_dtypes(include = ['object'])
for col in cat:
    mean_encode = accidata_new.groupby(col)['Accident_Severity'].mean()
    accidata_new.loc[:,col] = accidata_new[col].map(mean_encode)

### 6. Implement a basic Logistic Regression Model using scikit learn with cross validation = 5, where you predict the severity of the accident (Accident_Severity). Note that here your goal is not to tune appropriate hyperparameters, but to figure out what features will be best to use.

In [None]:
import sklearn.linear_model as linear_model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegressionCV

In [None]:
y = accidata_new['Accident_Severity']
x = accidata_new.drop(columns=['Accident_Severity'],inplace= False)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x = scaler.fit_transform(x)

In [None]:
cf = LogisticRegressionCV(cv=5, multi_class="multinomial", max_iter=1000,verbose=100).fit(x, y)
print(cf.score(x,y))

We can see that it is giving 85%