# Sentiment Analysis from A14 road users comments

## Introduction

The user of the [A14 road](https://highwaysengland.co.uk/A14-Cambridge-to-Huntingdon-Improvement-Scheme-home) can report incident using an application. The goal of this challenge proposed by the organizers of the [Project:Hack5 hackathon](https://projectdataanalytics.uk/eventer/projecthack-5) and [Highways England](https://highwaysengland.co.uk/) is to perform sentiment analysis to obtain new insigts from the users comments and improve the user experiences on the application.

# Variables Description

* ID: identification number of the user reporting an incident on the A14 road
* PracticeType: User entered report nature: Hazard or Good Practice Observation
* IncidentType: Type of event reported
* HSWorEn: Nature of the incident: Health and Safety or Environement
* Section: Section of A14 road where the incident occured
* Location: Precised location on the section
* ObservationDateTime: Data and time of the user from completion
* Summary: Short sumary of the report entered by the user
* Description: Full description of the event by the user
* ActionTaken: Action taken by Highways England
* FatalCategory: Category of the incident
* HEObservationCategory: Category of the Health and Safety Observation
* HEReportingType: Health and Safety reporting

## Initialization

In [None]:
# load libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import matplotlib.pyplot as plt # ploting the data
import seaborn as sns # ploting the data
import math # calculation

from textblob import TextBlob #test analysis
import nltk
from wordcloud import WordCloud # to generatewordcloud analysis

In [None]:
# Load the dataset
df = pd.read_csv('/kaggle/input/Observations2.csv', engine = "python")

In [None]:
df.info()

In [None]:
# Drop the unnecessary variables or the variables not entered by the user in the application
df.drop(['ID', 'HSWorEnv', 'Location', 'FatalCategory', 'HEObservationCategory'], axis=1, inplace=True)

## Add scores column

In [None]:
# Polarity: 
def extract_sentiment_polarity(text):
    try:
        return TextBlob(text).sentiment.polarity
    except:
        return None

def extract_sentiment_subjectivity(text):
    try:
        return TextBlob(text).sentiment.subjectivity
    except:
        return None
    
df["Summary_polarity"] = df["Summary"].apply(extract_sentiment_polarity)
df["Summary_subjectivity"] = df["Summary"].apply(extract_sentiment_subjectivity)

df["Description_polarity"] = df["Description"].apply(extract_sentiment_polarity)
df["Description_subjectivity"] = df["Description"].apply(extract_sentiment_subjectivity)

df["ActionTaken_polarity"] = df["ActionTaken"].apply(extract_sentiment_polarity)
df["ActionTaken_subjectivity"] = df["ActionTaken"].apply(extract_sentiment_subjectivity)

## Visualization

In [None]:
# Set up visualization colors

# Set up color blind friendly color palette
# The palette with grey:
cbPalette = ["#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"]
# The palette with black:
cbbPalette = ["#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"]

# sns.palplot(sns.color_palette(cbPalette))
# sns.palplot(sns.color_palette(cbbPalette))

sns.set_palette(cbPalette)
#sns.set_palette(cbbPalette)

### Categories count

In [None]:
title = 'Practice Type Count'
sns.countplot(y = df['PracticeType'])
plt.title(title)
plt.ioff()

In [None]:
title = ' Count IncidentType'
f, ax = plt.subplots(figsize=(10, 10))
sns.countplot(y = df['IncidentType'])
plt.title(title)
plt.ioff()

In [None]:
title = 'Section Count'
sns.countplot(y = df['Section'])
plt.title(title)
plt.ioff()

In [None]:
title = 'HEReportingType Count'
sns.countplot(y = df['HEReportingType'])
plt.title(title)
plt.ioff()

### PracticeType according to sentiment

In [None]:
title = 'Median Summary Polarity per Practice Type'
result = df.groupby(["PracticeType"])['Summary_polarity'].aggregate(np.median).reset_index().sort_values('Summary_polarity')
sns.barplot(x='PracticeType', y="Summary_polarity", data=df)
plt.title(title)
plt.ioff()

In [None]:
title = 'Median Summary Subjectivity per Practice Type'
result = df.groupby(['PracticeType'])['Summary_subjectivity'].aggregate(np.median).reset_index().sort_values('Summary_subjectivity')
sns.barplot(x='PracticeType', y='Summary_subjectivity', data=df)
plt.title(title)
plt.ioff()

In [None]:
title = 'Descrition Polarity per Practice Type'
result = df.groupby(['PracticeType'])['Description_polarity'].aggregate(np.median).reset_index().sort_values('Description_polarity')
sns.barplot(x='PracticeType', y='Description_polarity', data=df)
plt.title(title)
plt.ioff()

In [None]:
title = 'Descrition Subjectivity per Practice Type'
result = df.groupby(['PracticeType'])['Description_subjectivity'].aggregate(np.median).reset_index().sort_values('Description_subjectivity')
sns.barplot(x='PracticeType', y='Description_subjectivity', data=df)
plt.title(title)
plt.ioff()

### Sentiment per section

In [None]:
# see https://www.kaggle.com/nidaguler/eda-and-data-visualization-ny-airbnb
title = 'Median Sentiment Polarity according to Section'
result = df.groupby(['Section'])['Summary_polarity'].aggregate(np.median).reset_index().sort_values('Summary_polarity')
sns.barplot(x='Section', y='Summary_polarity', data=df, order=result['Section'])
plt.title(title)
plt.ioff()

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
title = 'Sentiment Polarity according to Incident type'
result = df.groupby(['IncidentType'])['Summary_polarity'].aggregate(np.median).reset_index().sort_values('Summary_polarity')
sns.barplot(y='IncidentType', x='Summary_polarity', data=df, order=result['IncidentType'])
plt.title(title)
plt.ioff()

### Association between categorical variables

In [None]:
pd.crosstab(index=df['PracticeType'], columns=df['IncidentType'])

In [None]:
contingency_table = pd.crosstab(index=df['PracticeType'], 
                          columns=df['Section'])
contingency_table

In [None]:
contingency_table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

In [None]:
contingency_table = pd.crosstab(index=df['PracticeType'], 
                          columns=df['HEReportingType'])
contingency_table

In [None]:
contingency_table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

### Wordcloud analysis

In [None]:
# Separate the data into Hazard and Good Practice dataset
df_Hazard = df.loc[(df['PracticeType'] == 'Hazard')]
df_Good = df.loc[(df['PracticeType'] == 'Good Practice')]

In [None]:
# See https://stackoverflow.com/questions/33279940/how-to-combine-multiple-rows-of-strings-into-one-using-pandas
text_hazard = df_Hazard.Summary.str.cat(sep=', ') # Contenate the text of all rows of the Summary column
print ("There are {} words in the combination of all Summary.".format(len(text_hazard)))

In [None]:
# https://www.datacamp.com/community/tutorials/wordcloud-python
# Create stopword list:
stopwords = (['for', 'in', 'the', 'and', 'on', 'site', 'to', 'or', 'with', 'from', 'A14', 'when', 'there', 'is'])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text_hazard)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
text_good = df_Good.Summary.str.cat(sep=', ') # Contenate the text of all rows of the Summary column
print ("There are {} words in the combination of all Summary.".format(len(text_good)))

In [None]:
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text_good)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Modeling the Practice Type

The goal is to generate autocompletion form for the application users. If we can pre filled some entries the users are more likely to complete the whole form.

In [None]:
df.drop(['ObservationDateTime', 'Summary', 'Description', 'ActionTaken', 'HEReportingType'], axis=1, inplace=True)

### Data Encoding

In [None]:
# Encoding categorical data
# See https://pbpython.com/categorical-encoding.html
#data = pd.get_dummies(data, columns=['IncidentType', 'Section', 'HEReportingType'], drop_first=True)
df = pd.get_dummies(df, columns=['IncidentType', 'Section'], drop_first=True)

### Deal with missing values

In [None]:
# Keep only the row with known PracticeType
df =  df.loc[(df['PracticeType'] == 'Hazard') | (df['PracticeType'] == 'Good Practice')]

In [None]:
# Split the dataset
y = df['PracticeType'].values

In [None]:
# https://machinelearningmastery.com/handle-missing-data-python/
from sklearn.impute import SimpleImputer
values = df.drop('PracticeType', axis=1).values
imputer = SimpleImputer()
transformed_values = imputer.fit_transform(values)
# count the number of NaN values in each column
print(np.isnan(transformed_values).sum())

In [None]:
X = transformed_values

### Split the dataset

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

### Fit the model: Random Forest¶

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

accuracy_score(y_test, y_predict)

# Conclusion

Using a Random Forest model, we can predict the nature of the user observation with an 84% accuracy. We could then use this model to prefill this variable on the form filled by the A14 users. This approach will reduce the amount of data to enter for the users and likely increase the form completion rate.

## References
### Visualization
* https://www.analyticsvidhya.com/blog/2019/09/comprehensive-data-visualization-guide-seaborn-python/
* https://elitedatascience.com/python-seaborn-tutorial

### Categorical Variables
* https://dzone.com/articles/correlation-between-categorical-and-continuous-var-1
* https://adataanalyst.com/data-analysis-resources/visualise-categorical-variables-in-python/

### Natural Language Processing
* https://textblob.readthedocs.io/en/dev/quickstart.html
* https://planspace.org/20150607-textblob_sentiment/

### Machine Learning
* https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd
* https://stackoverflow.com/questions/3172509/numpy-convert-categorical-string-arrays-to-an-integer-array

### Wordcloud analysis
* https://www.datacamp.com/community/tutorials/wordcloud-python
* https://www.kaggle.com/zynicide/wine-reviews/kernels

### Modeling
* https://ehackz.com/2018/03/23/python-scikit-learn-random-forest-classifier-tutorial/
* https://scikit-learn.org/stable/modules/ensemble.html

### Missing values
* https://machinelearningmastery.com/handle-missing-data-python/