<h3><b>This notebook presents a univariate analysis for all variables in the US Accidents dataset</b></h3>
<br>To get a full list of the data dicitionary visit the official web page at: https://smoosavi.org/datasets/us_accidents

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# reading the data into a dataframe
accident_df = pd.read_csv('../input/us-accidents/US_Accidents_May19.csv')

In [None]:
# finding out number of rows and columns
shape = accident_df.shape
print( "Number of rows" , "{:,}".format(shape[0]) )
print( "Number of variables" , shape[1] )

In [None]:
# print a sample to get sense of data
accident_df.head(5)

Some variables are numerical and some are categorical. Let's put them into two seperate groups using the ***get_numerica_data()***

In [None]:
cols = accident_df.columns
num_cols = accident_df._get_numeric_data().columns

The numerical values are

In [None]:
', '.join( str(x) for x in list(num_cols) )

> note that this only shows the columns that have numbers, it does NOT say whether its continous or not

In [None]:
str_col = list( set(cols) - set(num_cols) )
', '.join( str(x) for x in str_col ) 

> Some variables like zipcode are stored as string not numbers. We will address this later in the analysis

<h1>Categorical Variables</h1>

Categorical variables take on values that are names or labels [1] 

The following varaiables are categorical: <i>Source, TMC, Severity, Number+Street, Side, City, County, State, Zipcode, Country, Timezone, Airport_Code, Wind_Direction, Weather_Conditio,	Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway, Roundabout, Station,	Stop, Traffic_Calming, Traffic_Signal, Turning_Loop, Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight</i>

Street number by itself is meangless, so I will combine with it the street name in a new column called 'street_name_num'

In [None]:
# Street Number is float: can't find different way to drop decimal points and then change Number column to string other than this way
accident_df['Number'] = accident_df['Number'].replace(np.nan, 0)
accident_df['Number'] = accident_df['Number'].astype(int)
accident_df['Number'] = accident_df['Number'].astype(str)
accident_df['Number'] = accident_df['Number'].replace('0', '')
accident_df['street_name_num'] = accident_df['Number'] + ' ' + accident_df['Street']

In [None]:
accident_df.sample(5)[ ['Number' , 'Street' , 'street_name_num'] ] 

> street_name_num column combines both street number with street name

<h3>Missing Values</h3>

Most likely the data is complete but its alway a good practice to check for missing values

In [None]:
percent_missing = accident_df.isnull().sum() * 100 / len(accident_df)
missing_value_df = pd.DataFrame({'column_name': accident_df.columns,
                                 'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=False, ascending = False)[missing_value_df[ 'percent_missing'] != 0]


1. > <i>Precipitation, Wind_chill, end_lat, and end_lng </i> have the highest missing values. The can be dropped unless further analysis show a value of keeping them for any model in future
2. > Other variables like Wind_speed, weather_condition, visibility, humidity, termperature, pressure, wind_direction, and weather_timestamp can be imputed
3. > In general, the data dont have missing values for important variables such as severity, location, time, weather, and drive condition


<h2>Most frequent analysis</h2>
<p>For each categorical variable, I will display the most frequent values. The following subsections, will group variables according to their meaning.</p>

In [None]:
def create_plots(columns_list, ncols=3):
    nrows= int( (len(columns_list) -1) / ncols)  + 1
    fig, axs = plt.subplots(nrows=nrows , ncols=ncols, figsize= (20,nrows*5))
    plt.subplots_adjust(hspace=0.7)
    plt.subplots_adjust(wspace=0.5)

    sns.set(style="darkgrid")

    for index ,column in enumerate( columns_list ):
        order = accident_df[column].value_counts().iloc[:10].index

        if nrows == 1:
            g = sns.countplot(accident_df[column], alpha=0.9 ,  
                         order= order,
                         ax=axs[ index ])
        else:
            g = sns.countplot(accident_df[column], alpha=0.9 ,  
                         order= order,
                         ax=axs[ int(index / ncols) ][ int(index % ncols) ])
            
        g.set_xticklabels(rotation=60, labels = order )
        g.set_title(column)

<h3>Location</h3>
<p>This part will look at the distribution of the following variables: <i>'City', 'County', 'State' , 'Zipcode' , 'Country' , 'street_name_num' </i>

In [None]:
columns_list = ['City', 'County', 'State' , 'Zipcode' , 'Country' , 'street_name_num' ]
create_plots(columns_list, ncols=3)

> This six figures above shows that: 
* California has the most accidents among all states followed by Texas and Florida
* Los Angeles in California is the most county with accident followed by Harris in Texas
* Interestingly, Charlotte in NC appears as the second most city that has accident while North Carolina is the fourth when looking at the states in general.
* Zip code is just another refliection of counties figure with Los Angelos has the highest
* Country variables is not useful it may be dropped later since it only has one value (US).
* Top 10 streets shows that most accidents occur on highways not local streets which make sense

<h3>Time</h3>

<p>This part will look at the distribution of the following variables: <i>'Timezone' , 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight' </i>

In [None]:
columns_list = ['Timezone' , 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight']
create_plots(columns_list,ncols=3)

From the five figures above most accidents happens on the day time. So all of these varaibles can be reduced to one variable. The US/Eastern time has more values suggesting that states with Eastern time has more accidents in total compared to other states with other time zones.

<h3>Driving Conditions</h3>
This part will look at the following variables: <i>'Airport_Code', 'Wind_Direction', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Side', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop' </i>

In [None]:
columns_list = ['Airport_Code', 'Wind_Direction', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Side', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 
                'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop'] 
create_plots(columns_list,ncols=3)

> The previous figures shows the presence of some specific road conditions. 
* Most recorded accidents has False (no presence) values for the conditions. 
* Turning_Loop can be dropped since its always has False value
<p>Since this notebook for univariate analysis nothing can be tell about there relation with severity level or which objects correlate with each other. I will consider this relations in the next round of analysis </p>



<h3>Accidents</h3>
Accident severity and source is the last two categorical varaibles will be visisted in this notebook. 

In [None]:
columns_list = ['Severity' , 'Source']
create_plots(columns_list, ncols=2)

> This shows that most accidents have severity level of 2, and MapQuest is the main accident source

<h1>Numerical Variables</h1>
<p>A numerical variable is a variable where the measurement or number has a numerical meaning. [2]
The following varaibles are numerical variables: <i>Distance(mi), Temperature(F), Wind_Chill(F), Humidity(%), Pressure(in), Visibility(mi), Wind_Speed(mph), Precipitation(in) </i>

In [None]:
from scipy.stats import describe

sns.set(color_codes=True)

for column_name in ['Distance(mi)', 'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)']:

    mean = accident_df[column_name].mean()
    std = accident_df[column_name].std()
    min_ = accident_df[column_name].min()
    max_ = accident_df[column_name].max()
    kurt = accident_df[column_name].kurt()
    skew = accident_df[column_name].skew()

    print( column_name, ',min =' , min_ , ',max =' , max_ , ',avg =' , mean , ',std =' , std, ',skewness =' , skew, ',kurtosis =' , kurt , end='\n')
    print()

> 'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)' should follow a nomral distribution while the other variables have high skewness and kurtosis that means there values are centered around the mean values. See the example below of distribution plot for 'Precipitation(in)' and 'emperature(F)'

In [None]:
fig, axs = plt.subplots(nrows=1 , ncols=2, figsize= (10,5))

sns.distplot(  accident_df[  accident_df['Precipitation(in)'].isnull() == False ]['Precipitation(in)'] , ax=axs[ 0 ])
sns.distplot(  accident_df[  accident_df['Temperature(F)'].isnull() == False ]['Temperature(F)'] , ax=axs[ 1 ])

<h3>Wrap Up</h3>
<p>This notebook presents a univariate analysis for the US Accidents dataset. While interesting take away will not be feasabile with univariate analysis but it should be the start for initial data analysis. The next notebook will look into more mulivariate analysis and corelation. </p>
<p>For anay suggestions, edits, or questions leave it in the comment below</p>

> 

<h3>References</h3>
<li>[1] https://stattrek.com/statistics/dictionary.aspx?definition=categorical%20variable</li>
<li>[2] https://socratic.org/questions/what-is-a-numerical-variable-and-what-is-a-categorical-variable</li>