<a href="https://colab.research.google.com/github/zusoomro/545FinalProject/blob/master/545_Cleaning_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cleaning and Preprocessing the Accident Dataset

## Setting up

### Loading in the necessary modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib

import gc
import time
import warnings

import nltk
nltk.download('punkt')

from wordcloud import WordCloud
import matplotlib.pyplot as plt

from collections import Counter

import scipy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.



## Loading in the data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

drive_data_path = "/content/drive/My Drive/CIS 545/Final Project/US_Accidents_Dec19.csv"

with open(drive_data_path, 'r') as f:
  data = pd.read_csv(f)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
data.shape

(2974335, 49)

## Part One: Easier said than done

This part of the data processing is going to be focused on cleaning the data and retrieving initial features.

### Cleaning out irrelevant and invalid data

First, we're going to remove columns that aren't helpful to the task at hand: predicting severity. For example, we aren't going to be doing any sort of prediction using the location of the accident, so columns involving location can be dropped. 

In [0]:
processed_data = data.drop(['Start_Lat', 'End_Lat', 'Start_Lng', 'End_Lng', 
                            'Number', 'Street', 'City', 'County', 'State', 
                            'Zipcode', 'Country', 'Timezone', 'Airport_Code'], 
                           axis=1)

In [5]:
processed_data.shape

(2974335, 36)

I'm also going to drop many of the weather columns - from the EDA most of them have very low correlation with severity, and in doing so I hope to reduce the number of rows overall that need to be dropped because of null values. I'm going to keep around Weather Condiiton, Visibility, Precipitation, and Temperature as general indicators of the weather at the accident. Later in the analysis, if it seems that these are extremely important factors in predicting severity, I'll revisit this decision and perhaps keep more of the weather columns in at the expense of dropping more columns in general.

In [0]:
processed_data = processed_data.drop(['Weather_Timestamp', 'Wind_Chill(F)', 
                                      'Humidity(%)', 'Pressure(in)', 
                                      'Wind_Direction', 'Wind_Speed(mph)'], axis=1)

Additionally, the ID column is not useful to us since pandas maintains its own separate id system.

In [0]:
processed_data = processed_data.drop(['ID'], axis=1)

Next, we're going to take a look at the number of rows and columns containing NaN's and consider dropping them.

In [8]:
processed_data.isnull().any(axis=1).sum()  

2271132

In [9]:
processed_data.isnull().sum(axis=0)

Source                         0
TMC                       728071
Severity                       0
Start_Time                     0
End_Time                       0
Distance(mi)                   0
Description                    1
Side                           0
Temperature(F)             56063
Visibility(mi)             65691
Precipitation(in)        1998358
Weather_Condition          65932
Amenity                        0
Bump                           0
Crossing                       0
Give_Way                       0
Junction                       0
No_Exit                        0
Railway                        0
Roundabout                     0
Station                        0
Stop                           0
Traffic_Calming                0
Traffic_Signal                 0
Turning_Loop                   0
Sunrise_Sunset                93
Civil_Twilight                93
Nautical_Twilight             93
Astronomical_Twilight         93
dtype: int64

It looks like Precipitation has over 1.9 million null values, so I'm going to drop it in an effort to reduce the total number of rows that we need to drop. Because the rest of the null values are below 100k, I'm going to drop the individual rows there instead of the columns.

In [0]:
processed_data = processed_data.drop(['Precipitation(in)'], axis=1)

In [11]:
processed_data.shape

(2974335, 28)

And taking a look at the nulls on the rows again...

In [12]:
processed_data.isnull().any(axis=1).sum()  

785655

In [13]:
processed_data.isnull().sum(axis=0)

Source                        0
TMC                      728071
Severity                      0
Start_Time                    0
End_Time                      0
Distance(mi)                  0
Description                   1
Side                          0
Temperature(F)            56063
Visibility(mi)            65691
Weather_Condition         65932
Amenity                       0
Bump                          0
Crossing                      0
Give_Way                      0
Junction                      0
No_Exit                       0
Railway                       0
Roundabout                    0
Station                       0
Stop                          0
Traffic_Calming               0
Traffic_Signal                0
Turning_Loop                  0
Sunrise_Sunset               93
Civil_Twilight               93
Nautical_Twilight            93
Astronomical_Twilight        93
dtype: int64

TMC is an important part of the analysis for this problem, so although there are ~700k null values there, I'm going to move forward and drop rows with null values. If this causes a problem in further analysis, I'll revisit.

In [0]:
processed_data = processed_data.dropna()

In [15]:
processed_data.shape

(2188680, 28)

### Processing the dates into an accident duration

The current state of the data has start and end time as datetime objects. This won't be helpful for modeling, so I'm going to process these columns into an "Accident_Duration" column, which represents start time minus finish time in seconds.

In [16]:
processed_data[['Start_Time', 'End_Time']].dtypes

Start_Time    object
End_Time      object
dtype: object

Looks like theyre objects, we're going to have to process them into datetime objects first.

In [17]:
processed_data[['Start_Time', 'End_Time']].head()

Unnamed: 0,Start_Time,End_Time
0,2016-02-08 05:46:00,2016-02-08 11:00:00
1,2016-02-08 06:07:59,2016-02-08 06:37:59
2,2016-02-08 06:49:27,2016-02-08 07:19:27
3,2016-02-08 07:23:34,2016-02-08 07:53:34
4,2016-02-08 07:39:07,2016-02-08 08:09:07


In [0]:
processed_data['Start_Time'] = pd.to_datetime(processed_data['Start_Time'])
processed_data['End_Time'] = pd.to_datetime(processed_data['End_Time'])

In [19]:
processed_data[['Start_Time', 'End_Time']].head()

Unnamed: 0,Start_Time,End_Time
0,2016-02-08 05:46:00,2016-02-08 11:00:00
1,2016-02-08 06:07:59,2016-02-08 06:37:59
2,2016-02-08 06:49:27,2016-02-08 07:19:27
3,2016-02-08 07:23:34,2016-02-08 07:53:34
4,2016-02-08 07:39:07,2016-02-08 08:09:07


In [20]:
processed_data[['Start_Time', 'End_Time']].dtypes

Start_Time    datetime64[ns]
End_Time      datetime64[ns]
dtype: object

Now, to convert them to a duration.

In [0]:
processed_data['Duration(s)'] = processed_data['Start_Time'] - processed_data['End_Time']
processed_data['Duration(s)'] = processed_data['Duration(s)'].dt.seconds

In [22]:
processed_data['Duration(s)']

0          67560
1          84600
2          84600
3          84600
4          84600
           ...  
2246259    84623
2246260    84665
2246261    84701
2246262    84661
2246263    84600
Name: Duration(s), Length: 2188680, dtype: int64

In [23]:
processed_data['Duration(s)'].describe()

count    2.188680e+06
mean     8.352427e+04
std      1.993337e+03
min      0.000000e+00
25%      8.283300e+04
50%      8.385100e+04
75%      8.461900e+04
max      8.632700e+04
Name: Duration(s), dtype: float64

In [0]:
processed_data = processed_data.drop(['Start_Time', 'End_Time'], axis=1)

In [25]:
processed_data.dtypes

Source                    object
TMC                      float64
Severity                   int64
Distance(mi)             float64
Description               object
Side                      object
Temperature(F)           float64
Visibility(mi)           float64
Weather_Condition         object
Amenity                     bool
Bump                        bool
Crossing                    bool
Give_Way                    bool
Junction                    bool
No_Exit                     bool
Railway                     bool
Roundabout                  bool
Station                     bool
Stop                        bool
Traffic_Calming             bool
Traffic_Signal              bool
Turning_Loop                bool
Sunrise_Sunset            object
Civil_Twilight            object
Nautical_Twilight         object
Astronomical_Twilight     object
Duration(s)                int64
dtype: object

I'm going to want to convert the Source, side, and weather conditions to a one-hot encoding.

### One-hot encodings

There's a lot of categorical information in this data which might be helpful for modeling severity. I'm going to convert them to one-hot encodings.

In [26]:
processed_data = \
  pd.concat([processed_data.drop(['Source'], axis=1), 
             pd.get_dummies(processed_data['Source'], 
             sparse=True)], axis=1);
processed_data.shape

(2188680, 28)

In [27]:
processed_data = \
  pd.concat([processed_data.drop(['Side'], axis=1), 
             pd.get_dummies(processed_data['Side'],
             sparse=True)], axis=1);
processed_data.shape

(2188680, 30)

In [28]:
processed_data = \
  pd.concat([processed_data.drop(['Weather_Condition'], axis=1),
             pd.get_dummies(processed_data['Weather_Condition'], 
             sparse=True)], axis=1);
processed_data.shape

(2188680, 145)

There's a lot of columns there. If this blows up the feature space too much, and performing operations ends up taking too long, I'll consider dropping this information.
EDIT: Using pandas sparse matrices and eventually scipy sparse matrices to reduce features size and computation time

In [29]:
processed_data = \
  pd.concat([processed_data.drop(['Sunrise_Sunset'], axis=1),
             pd.get_dummies(processed_data['Sunrise_Sunset'], 
             sparse=True)], axis=1);
processed_data.shape

(2188680, 146)

In [30]:
processed_data = \
  pd.concat([processed_data.drop(['TMC'], axis=1),
             pd.get_dummies(processed_data['TMC'], 
             sparse=True)], axis=1);
processed_data.shape

(2188680, 166)

And now dropping the other Sunrise/Sunset columns since they convey nearly identical information.

In [0]:
processed_data = processed_data.drop(['Civil_Twilight',
                                      'Nautical_Twilight',
                                      'Astronomical_Twilight',
                                      ], axis=1)

In [32]:
processed_data.shape

(2188680, 163)

### Bools to ints

Converting the location booleans to integer fields.

In [33]:
processed_data.dtypes[0:35]

Severity                           int64
Distance(mi)                     float64
Description                       object
Temperature(F)                   float64
Visibility(mi)                   float64
Amenity                             bool
Bump                                bool
Crossing                            bool
Give_Way                            bool
Junction                            bool
No_Exit                             bool
Railway                             bool
Roundabout                          bool
Station                             bool
Stop                                bool
Traffic_Calming                     bool
Traffic_Signal                      bool
Turning_Loop                        bool
Duration(s)                        int64
MapQuest                Sparse[uint8, 0]
MapQuest-Bing           Sparse[uint8, 0]
                        Sparse[uint8, 0]
L                       Sparse[uint8, 0]
R                       Sparse[uint8, 0]
Blowing Dust    

In [0]:
processed_data["Amenity"] = processed_data["Amenity"].astype(int)
processed_data["Bump"] = processed_data["Bump"].astype(int)
processed_data["Crossing"] = processed_data["Crossing"].astype(int)
processed_data["Give_Way"] = processed_data["Give_Way"].astype(int)
processed_data["Junction"] = processed_data["Junction"].astype(int)
processed_data["No_Exit"] = processed_data["No_Exit"].astype(int)
processed_data["Railway"] = processed_data["Railway"].astype(int)
processed_data["Roundabout"] = processed_data["Roundabout"].astype(int)
processed_data["Station"] = processed_data["Station"].astype(int)
processed_data["Stop"] = processed_data["Stop"].astype(int)
processed_data["Traffic_Calming"] = processed_data["Traffic_Calming"].astype(int)
processed_data["Traffic_Signal"] = processed_data["Traffic_Signal"].astype(int)
processed_data["Turning_Loop"] = processed_data["Turning_Loop"].astype(int)

In [35]:
processed_data.dtypes[0:35]

Severity                           int64
Distance(mi)                     float64
Description                       object
Temperature(F)                   float64
Visibility(mi)                   float64
Amenity                            int64
Bump                               int64
Crossing                           int64
Give_Way                           int64
Junction                           int64
No_Exit                            int64
Railway                            int64
Roundabout                         int64
Station                            int64
Stop                               int64
Traffic_Calming                    int64
Traffic_Signal                     int64
Turning_Loop                       int64
Duration(s)                        int64
MapQuest                Sparse[uint8, 0]
MapQuest-Bing           Sparse[uint8, 0]
                        Sparse[uint8, 0]
L                       Sparse[uint8, 0]
R                       Sparse[uint8, 0]
Blowing Dust    

Description is the only field here which is still an object - I'm going to convert them to tf-idf rankings in the next block.

In [36]:
processed_data.select_dtypes(include=['object']).head()

Unnamed: 0,Description
0,Right lane blocked due to accident on I-70 Eas...
1,Accident on Brice Rd at Tussing Rd. Expect del...
2,Accident on OH-32 State Route 32 Westbound at ...
3,Accident on I-75 Southbound at Exits 52 52B US...
4,Accident on McEwen Rd at OH-725 Miamisburg Cen...


In [37]:
processed_data.memory_usage().sum()

404905800

Under half a gig. Not bad. Gonna save the data here for future processing.





In [0]:
pickle = processed_data.to_pickle('/content/drive/My Drive/CIS 545/Final Project/processed_data_1')

## Part Two: Text processing

I'm going to process the descriptions first by standardizing and cleaning the text, then by converting the text into TF-IDF rankings to get a better sense of the importance of words and their relative rarity.

Here are some preprocessing steps I'm going to take:
- Clean out punctuation
- Take out stop words
- Convert all abbreviations
- Lowercase the entire text
- Stemming the words through lemmization

In [39]:
# Read in the previously stored pickle

from google.colab import drive
drive.mount('/content/drive')

processed_data_1_path = \
  '/content/drive/My Drive/CIS 545/Final Project/processed_data_1'

with open(processed_data_1_path, 'rb') as f:
  processed_data_2 = pd.read_pickle(f)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Convert to lowercase

processed_data_2['Description'] = (processed_data_2['Description']
                                   .apply(lambda x: x.lower()))

In [41]:
processed_data_2['Description']

0          right lane blocked due to accident on i-70 eas...
1          accident on brice rd at tussing rd. expect del...
2          accident on oh-32 state route 32 westbound at ...
3          accident on i-75 southbound at exits 52 52b us...
4          accident on mcewen rd at oh-725 miamisburg cen...
                                 ...                        
2246259    accident on i-5 southbound at forest rte-7n09 ...
2246260    left lane closed due to accident on i-10 at na...
2246261        accident on olive ave at ca-66 foothill blvd.
2246262    #1 lane blocked due to accident on i-605 north...
2246263    accident on temescal canyon rd eastbound in la...
Name: Description, Length: 2188680, dtype: object

In [42]:
# Remove punctuation

def remove_punctuation(string):
  # Dashes in Interstate names were causing them to be removed, so I'm replacing
  # them with spaces.
  string = string.replace('-', ' ')
  words = nltk.word_tokenize(string)
  new_words = [word for word in words if word.isalnum()]
  return " ".join(new_words)

processed_data_2['Description'] = (processed_data_2['Description']
                                   .apply(lambda x: remove_punctuation(x)))

processed_data_2['Description'].head()

0    right lane blocked due to accident on i 70 eas...
1     accident on brice rd at tussing rd expect delays
2    accident on oh 32 state route 32 westbound at ...
3    accident on i 75 southbound at exits 52 52b us...
4    accident on mcewen rd at oh 725 miamisburg cen...
Name: Description, dtype: object

In [43]:
# Take out stop words

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

def remove_stop_words(string):
  # We want to maintain interstate information, so I'm keeping the I
  words = nltk.word_tokenize(string)
  new_words = [word for word in words if (word not in stop or word is "i") ]
  return " ".join(new_words)

processed_data_2['Description'] = (processed_data_2['Description']
                                   .apply(lambda x: remove_stop_words(x)))

processed_data_2['Description'].head(20)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


0     right lane blocked due accident i 70 eastbound...
1            accident brice rd tussing rd expect delays
2     accident oh 32 state route 32 westbound dela p...
3     accident i 75 southbound exits 52 52b us expec...
4     accident mcewen rd oh 725 miamisburg centervil...
5     accident i 270 outerbelt northbound near exit ...
6           accident oakridge dr woodward expect delays
7     accident i 75 southbound exit 54b grand expect...
8          accident notre dame ave warner expect delays
9     right hand shoulder blocked due accident i 270...
10    accident i 270 outerbelt northbound exits 7 7a...
11    one lane blocked due accident i 70 westbound e...
12         accident revere ave watervliet expect delays
13    accident salem ave hillcrest ave kensington ex...
14       accident oh 16 broad st james rd expect delays
15             accident wayne ave glencoe expect delays
16         accident james h mcgee blvd us expect delays
17          accident delphos ave brooklyn expect

In [44]:
# Stemming verbs

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def stem(string):
  words = nltk.word_tokenize(string)
  new_words = [stemmer.stem(word) for word in words]
  return " ".join(new_words)

processed_data_2['Description'] = (processed_data_2['Description']
                                   .apply(lambda x: stem(x)))

processed_data_2['Description'].head()

0    right lane block due accid i 70 eastbound exit...
1                  accid brice rd tuss rd expect delay
2    accid oh 32 state rout 32 westbound dela palma...
3    accid i 75 southbound exit 52 52b us expect delay
4    accid mcewen rd oh 725 miamisburg centervill r...
Name: Description, dtype: object

In [45]:
processed_data_2['Description'].head(20)

0     right lane block due accid i 70 eastbound exit...
1                   accid brice rd tuss rd expect delay
2     accid oh 32 state rout 32 westbound dela palma...
3     accid i 75 southbound exit 52 52b us expect delay
4     accid mcewen rd oh 725 miamisburg centervill r...
5     accid i 270 outerbelt northbound near exit 29 ...
6                accid oakridg dr woodward expect delay
7     accid i 75 southbound exit 54b grand expect delay
8               accid notr dame ave warner expect delay
9     right hand shoulder block due accid i 270 oute...
10    accid i 270 outerbelt northbound exit 7 7a 7b ...
11    one lane block due accid i 70 westbound exit 1...
12              accid rever ave watervliet expect delay
13    accid salem ave hillcrest ave kensington expec...
14            accid oh 16 broad st jame rd expect delay
15                   accid wayn ave glenco expect delay
16              accid jame h mcgee blvd us expect delay
17               accid delpho ave brooklyn expec

This is a good point to stop, so I'm going to save the data again.

In [0]:
pickle = processed_data_2.to_pickle('/content/drive/My Drive/CIS 545/Final Project/processed_data_2')

## Part Three: TF-IDF

In [47]:
# Read in the previously stored pickle

from google.colab import drive
drive.mount('/content/drive')

processed_data_2_path = \
  '/content/drive/My Drive/CIS 545/Final Project/processed_data_2'

with open(processed_data_2_path, 'rb') as f:
  processed_data_3 = pd.read_pickle(f)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
processed_data_3.head(10)

Unnamed: 0,Severity,Distance(mi),Description,Temperature(F),Visibility(mi),Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Duration(s),MapQuest,MapQuest-Bing,Unnamed: 22,L,R,Blowing Dust,Blowing Dust / Windy,Blowing Sand,Blowing Snow,Blowing Snow / Windy,Clear,Cloudy,Cloudy / Windy,Drizzle,Drizzle / Windy,Drizzle and Fog,Fair,Fair / Windy,Fog,Fog / Windy,Freezing Rain,...,Squalls,Squalls / Windy,T-Storm,T-Storm / Windy,Thunder,Thunder / Windy,Thunder / Wintry Mix / Windy,Thunder in the Vicinity,Thunderstorm,Thunderstorms and Rain,Thunderstorms and Snow,Tornado,Volcanic Ash,Widespread Dust,Widespread Dust / Windy,Wintry Mix,Wintry Mix / Windy,Day,Night,200.0,201.0,202.0,203.0,206.0,222.0,229.0,236.0,239.0,241.0,244.0,245.0,246.0,247.0,248.0,336.0,339.0,341.0,343.0,351.0,406.0
0,3,0.01,right lane block due accid i 70 eastbound exit...,36.9,10.0,0,0,0,0,0,0,0,0,0,0,0,0,0,67560,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,0.01,accid brice rd tuss rd expect delay,37.9,10.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,0.01,accid oh 32 state rout 32 westbound dela palma...,36.0,10.0,0,0,0,0,0,0,0,0,0,0,0,1,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,0.01,accid i 75 southbound exit 52 52b us expect delay,35.1,9.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2,0.01,accid mcewen rd oh 725 miamisburg centervill r...,36.0,6.0,0,0,0,0,0,0,0,0,0,0,0,1,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,3,0.01,accid i 270 outerbelt northbound near exit 29 ...,37.9,7.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,2,0.0,accid oakridg dr woodward expect delay,34.0,7.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,3,0.01,accid i 75 southbound exit 54b grand expect delay,34.0,7.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,2,0.0,accid notr dame ave warner expect delay,33.3,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,3,0.01,right hand shoulder block due accid i 270 oute...,37.4,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,84600,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [49]:
processed_data_3.shape

(2188680, 163)

I'm going to compute TF-IDF and vary the hyperparamters to result in a vocabulary that is appropriately sized. The hyperparameters to consider here are:
- min_df: the number (if an int is passed in) or percentage of documents a term must appear in to be placed in the vocabulary
- max_df: the number (if an int is passed in) or percentage of documents a term must not appear in to be placed in the vocabulary
  - This is to take out terms like "the", which might be so common that they do not give us any information on the document
- ngram_range: the range of words that are allowed in one term in the vocab
  - For example, "the quick" with an ngram_range of (1, 2) would return "the", "quick", "the quick"

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(min_df=.0025, max_df=.1, ngram_range=(1, 4))
tvec.fit(processed_data_3['Description'])
tvec_weights = tvec.transform(processed_data_3['Description'])

In [51]:
type(tvec_weights)

scipy.sparse.csr.csr_matrix

In [52]:
weights = np.asarray(tvec_weights.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': tvec.get_feature_names(), 'weight': weights})
weights_df.sort_values(by='weight', ascending=False).head(20)

Unnamed: 0,term,weight
300,accid us,0.030109
589,near,0.021935
594,northbound exit,0.020724
369,ca,0.020478
680,southbound exit,0.020264
523,hwi,0.020054
560,ln,0.019162
665,shoulder block due,0.017517
666,shoulder block due accid,0.017346
611,pkwi,0.017096


In [53]:
len(tvec.vocabulary_)

747

Now that we have the TF-IDF representation, I'm going to convert the entire dataframe to a sparse scipy matrix and concatenate the TF-IDF features. This is because while the pandas sparse representation does a good job of compressing the data, sklearn actually blows it back up into a dense representation when training models, which is inefficient. Thus, we convert to scipy.

In [54]:
type(tvec_weights)

scipy.sparse.csr.csr_matrix

In [55]:
tvec_weights.shape

(2188680, 747)

In [56]:
tvec_weights[0]

<1x747 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

Now that we have all the information that we need from the description, we can drop the column so that we can convert the dataframe into a numerical matrix

In [0]:
procesed_data_3 = processed_data_3.drop(['Description'], axis=1, inplace=True)


In [58]:
procesed_data_3 = processed_data_3.drop(
    ['Start_Time', 'End_Time'], axis=1, inplace=True)

KeyError: ignored

In [0]:
sparse_processed = scipy.sparse.csr_matrix(processed_data_3.values.astype(np.float))

In [0]:
final_processed = scipy.sparse.hstack([sparse_processed, tvec_weights])

In [0]:
final_processed

Now saving this as a sparse scipy matrix.

In [0]:
scipy.sparse.save_npz(
    "/content/drive/My Drive/CIS 545/Final Project/final_processed.npz",
     final_processed)