***Classifiers and gun violence media coverage between 2013 and 2018 in the United States of America****

This project analyzes the coverage of gun violence incidents across the five years for which data is available in the Gun Violence Data Set. By training a supervised classifier with the help of the package scikit-learn to recognize whether articles discuss gun violence incidents in Republican or Democrat states, we explore basic machine learning classifiers. 

The primary assumption in the project is that the state in which an incident occured can be used as a proxy for classifying the source outlet as either Democrat or Republican. While this assumption is unverified (and future projects can aim to verify this proxy), when we consider that a majority of the incidents noted in the data set are small-scale, with few individuals murdered or even none murdered, there is no reason to assume that most of these incidents would have received nation-wide coverage. Rather, it is more likely that such incidents get covered by local or state-level outlets. Future more ambitious projects, should aim to conduct the classification by county and to verify that the coverage of the incidents is indeed from local outlets.  


**Research Question: **Can a supervised machine learning algorithm correctly classify gun violence coverage articles according to whether the incident occurred in Republican or Democrat states? 

**Sub Research Question: **Which characteristics in the corpora allow the classifier to work? 

* Loading all necessery packages.

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
from math import isnan
from itertools import islice
import numpy as np
import datashader as ds 
import datashader.transfer_functions as tf
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, inferno
from datashader.bokeh_ext import InteractiveImage
from functools import partial
from bokeh.models import BoxZoomTool
from bokeh.plotting import figure, output_notebook, show
from pandas import ExcelWriter
from pandas import ExcelFile
import nltk 
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics 
from sklearn.linear_model import LogisticRegression
import string 
import plotly as py 
from string import punctuation
import pyLDAvis
import pyLDAvis.gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.models.ldamodel import CoherenceModel
from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report
print(folium.__file__)
print(folium.__version__)

We take a sample of 50k articles and check whether the distribution of incidents per year in our sample is proportional to the original. We create an article_links dict through which we will do the scrapping and finally drop all values which are nan (of type float), i.e. there are no source url's there. The code for the scrapping process is not shown, but scrapping was completed with BeautifulSoup.  Only the final 50klinks file is kept. 

In [None]:
gunviolence = pd.read_csv("../input/gun-violence-data/gun-violence-data_01-2013_03-2018.csv")#gun violence data set 
gunviolence = gunviolence.dropna(subset = ['longitude', 'latitude'])
gunviolence = gunviolence[gunviolence.source_url.str.contains("youtube") == False]
gunviolence['year'] = gunviolence.date.str[:4] # we need this for folium for later 
sample_gun = gunviolence.loc[np.random.permutation(gunviolence.index)[:50000]] #random sample for scrapper
print(gunviolence['year'].value_counts())
print(sample_gun['year'].value_counts())
article_links = sample_gun.set_index('incident_id').to_dict()['source_url'] 
clean_links = {k:v for k,v in article_links.items() if type(v) != float}
tests = {k: clean_links[k] for k in list(clean_links)[:100]} # had test links for testing out scrapper
gunviolence.head()

Below we create a visualization of all incidents using datashader, using the number of individuals killed as a count for the color of the heatmap - the more red it is, the higher the number of incidents and the number of individuals killed. We can observe very few incidents in the mid-West due to the comparatively low population density. Unsurprisingly, locations such as Chicago, New York, Houston, San Diego and Los Angeles clearly stand out. We can zoom into the map using the zoom-toggle icon (third icon on the right of the map). 

In [None]:
output_notebook()

US = x_range, y_range = ((-161.75583 ,-68.01197), (19.50139,64.85694))

plot_width  = int(800)
plot_height = int(plot_width//1.2)

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)
    p.axis.visible = True
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
               
    p.add_tools(BoxZoomTool(match_aspect=True))
               
    return p
    
options = dict(line_color=None, fill_color='blue', size=5)


background = "black"
export = partial(export_image, export_path="export", background=background)
cm = partial(colormap_select, reverse=(background=="black"))

def create_image(x_range, y_range, w=plot_width, h=plot_height):
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.points(gunviolence, 'longitude', 'latitude',  ds.count('n_killed'))
    img = tf.shade(agg, cmap=Hot, how='eq_hist')
    return tf.dynspread(img, threshold=0.5, max_px=4)

p = base_plot(background_fill_color=background)
export(create_image(*US),"US_hot")
InteractiveImage(p, create_image)

The downside of scrapping with BeautifulSoup is that many of our articles also include text regarding updated privacy policies or other unnecessary text. Now we want to check which of the downloaded articles are actually useful. We know that many refuse a connection, many websites simply don't exist anymore, or the new EU GDPR is preventing us from accessing them. A good way of checking is by seeing whether the articles contain "http", as if they do not that means that they were overwritten by useful text - this is how we wrote the scrapper. If there was text found in the articles, the text overwrote the source_url in the value column. If there was no text the url stayed. The cleaning process removes 20700 articles, leaving us with 28629 articles that we can use for our analysis. 

In [None]:
clean_links = np.load('../input/50klinks/50klinks.npy').item()
#cleaning 
cleaner_articles = {}
not_so_clean_articles = {}
for key, value in clean_links.items():
    if 'http' in value and \
        len(value.split()) < 40:
        not_so_clean_articles[key] = value
    else:
        cleaner_articles[key] = value
#this step removes 3439 articles - if checked they are just links
#next step of clean-up - specifi terms/privacy policies/etc.
useful_articles = {}
trash_articles = {}
for key,value in cleaner_articles.items():
    if len(value.split()) < 10 or \
        "GDPR" in value or \
        "JavaScript" in value or \
        "page you requested is currently unavailable" in value or \
        "is no longer available" in value or \
        "page you requested could not be found" in value or\
        "CAPTCHA" in value or\
        "403" in value:
            trash_articles[key] = value 
    else:
        useful_articles[key] = value

gunviolence['texts']=gunviolence.incident_id.map(useful_articles)#add articles to data set 

Now we can start text analysis and for that we import a dataframe containing two columns - the state name and its designation as Red or Blue - Democrat or Republican. Out of the 28629 articles available around 17 thousand are about incidents in Republican states while around 12 thousand are about incidents in Democrat states. 

In [None]:
#merging extra df
usefuldf = gunviolence.dropna(subset=['texts'])
redvsblue = pd.read_excel("../input/red-vs-blue-states/Red vs Blue.xlsx")
usefuldf = pd.merge(redvsblue, usefuldf, on='state')
gunviolence = pd.merge(redvsblue,gunviolence, on= 'state')
usefuldf['Color'].value_counts()
repdf = usefuldf.loc[usefuldf['Color'] == 'Red']
demdf = usefuldf.loc[usefuldf['Color'] == 'Blue']
print(len(repdf))
print(len(demdf))

The following step is an initial exploration of the differences in most freqeuent terms used in the coverage across all articles, relating to incidents in Democrat states and relating to incidents in Republican states. Before we run the frequency count, we convert all words to lowercase, remove dots, and create a function which removes stopwords. 

In [None]:
#getting lists for frequency analysis 
all_analysis = usefuldf['texts'].tolist()
rep_analysis = repdf['texts'].tolist()
dem_analysis = demdf['texts'].tolist()
all_words = [x.lower().replace('.', ' ') for x in all_analysis]
rep_analysis = [x.lower().replace('.',' ') for x in rep_analysis]
dem_analysis = [x.lower().replace('.', ' ') for x in dem_analysis]
#stopword removal for most frequent word count 
stopwords = stopwords.words('english')# this line sometimes fails. re runing the first cell with packages usually fixes it 
stopwords.append("said") # we append this word due to the amount of times the article say that the Police or whoever "said" something.

This initial observation does not demonstrate any significant differences in the coverage, except showing clear trends of gun violence assosciated to males and females as likely victims.

In [None]:
def stopwording(y):
    y = [' '.join(w for w in line.split() if w.lower().strip(punctuation) not in stopwords) for line in y]
    return y  #can i fix punctuation
all_words = stopwording(all_analysis)
rep_words = stopwording(rep_analysis)
dem_words = stopwording(dem_analysis)

all_ready = []
for line in all_words:
    all_ready.extend(line.split())
rep_ready = []
for line in rep_words:
    rep_ready.extend(line.split())
dem_ready = []
for line in dem_words:
    dem_ready.extend(line.split())
def counting(y):
    y_counter = Counter(y)
    y_most_common = y_counter.most_common(50)
    return y_most_common

all_words_count = counting(all_ready)
print(all_words_count)
rep_words_count = counting(rep_ready)
print(rep_words_count)
dem_words_count = counting(dem_ready)
print(dem_words_count) 

The next step involves training a supervised machine learning model to see whether and how successfully it can recognize the differences between Republican and Democrat state incident coverage. For this process we use the package scikit-learn. First we create three data sets, one for training, one for validating and one for testing, at a ratio of 60:20:20. We pre-process these by making all words lower case, stripping punctuation and removing stopwords. Stemming is not done here but is reccomended.  We then create tuples for each, the first element containing the text, the second element containing the party color - Red or Blue.

In [None]:
train, validate, test = np.split(usefuldf.sample(frac=1), [int(.6*len(usefuldf)), int(.8*len(usefuldf))])
def cleaning(y):
    y = y[['Color', 'texts']]
    y_text_list = y['texts'].tolist()
    y_text_list = [line.lower().strip(punctuation) for line in y_text_list]       
    y = [line for line in y_text_list if line not in stopwords]
    y = [' '.join(w for w in line.split() if w.lower() not in stopwords) for line in y_text_list]
    return y 
train_text_list = cleaning(train)
train_color_list = train['Color'].tolist()
train = list(zip(train_text_list, train_color_list))
validate_text_list = cleaning(validate)
validate_color_list = validate['Color'].tolist()
validate = list(zip(validate_text_list, validate_color_list))
test_text_list = cleaning(test)
test_color_list = test['Color'].tolist()
test = list(zip(test_text_list, test_color_list))

The first model we create is a Naive Bayes classifier based on a count vectorizer, based on the frequency of words in documents, which returns an accuracy score of .91, a precision rate of .88 and a recall rate of .93 when examined for Blue. When examined for Red the model returns a precision rate of .95 and a recall rate of .91. The average f1-score is .92. These are rather good results already, showing that our model is capable of accurately classifying whether an article is Red or Blue in a majority of cases. We yet don't know why this works so well. First we run some more models. 

In [None]:
# Naive Bayes classic 
vectorizer = CountVectorizer(stop_words = 'english')
train_features = vectorizer.fit_transform(r[0] for r in train)
validate_features = vectorizer.transform(r[0] for r in validate)
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
predictions = nb.predict(validate_features)
actual =[r[1] for r in validate]
print(metrics.accuracy_score(actual,predictions,normalize = True))
print(classification_report(actual, predictions))

The second model we run is a logistic regression, based on a count vecorizer. This returns an accuracy score of .95, a precision rate of .95 for label Blue and a recall rate of .93. Checking the precision and recall of the same model for label Red returns a precision score of .95 and a recall score of .97. The average f1-score is .95. 

In [None]:
#Logistic regression classic 
vectorizer = CountVectorizer(stop_words = 'english')
train_features = vectorizer.fit_transform(r[0] for r in train)
validate_features = vectorizer.transform(r[0] for r in validate )
logreg = LogisticRegression()
logreg.fit(train_features, [r[1] for r in train])
predictions = logreg.predict(validate_features)
actual =[r[1] for r in validate]
print(metrics.accuracy_score(actual,predictions,normalize = True))
print(classification_report(actual, predictions))

Here we train a Naive Bayes classifier using the weight of term frequency as a vectorizer - tfidf in scikit-learn. This model returns an accuracy score of .89, a precision score of .99 for label Blue and a recall score of .75. Running the same model for label Red gives us a precision score of .85 and a recall score of .99. The average f1-score is .89. Our Naive Bayes classifier did better without the term frequency weight.

In [None]:
#Naive Bayes tfidf 
vectorizer = TfidfVectorizer(stop_words = 'english')
train_features = vectorizer.fit_transform(r[0] for r in train)
validate_features = vectorizer.transform(r[0] for r in validate )
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
predictions = nb.predict(validate_features)
actual =[r[1] for r in validate]
print(metrics.accuracy_score(actual,predictions,normalize = True))
print(classification_report(actual, predictions))

Finally, we run a logistic regression model which considers the weight of the term frequency - tfidf. This gives us an accuracy score of .93, a precision score of .96 and a recall score of .88 for label Blue. Running the same code with label Red gives us a precision score of .92 and a recall score of .98. The average f1-score is .93.

In [None]:
#Logistic tdidf
vectorizer = TfidfVectorizer(stop_words = 'english')
train_features = vectorizer.fit_transform(r[0] for r in train)
validate_features = vectorizer.transform(r[0] for r in validate )
logreg = LogisticRegression()
logreg.fit(train_features, [r[1] for r in train])
predictions = logreg.predict(validate_features)
actual =[r[1] for r in validate]
print(metrics.accuracy_score(actual,predictions,normalize = True))
print(classification_report(actual, predictions))

We can see that the best score is provided by the logistic regression model which does not consider the term weight frequency, with an average f-1 score of .95 and recall and precision rates for both labels well above .9, with an accuracy score of .95. We now run this model on the test data and confirm that indeed it does the classification rather well. However, we do not understand why this is the case. The next step would be to examine the topics in coverage about incidents in Democrat and Republican states using LDA to see whether we can determine what makes this classifier work.

In [None]:
vectorizer = CountVectorizer(stop_words = 'english')
train_features = vectorizer.fit_transform(r[0] for r in train)
test_features = vectorizer.transform(r[0] for r in test)
logreg = LogisticRegression()
logreg.fit(train_features, [r[1] for r in train])
predictions = logreg.predict(test_features)
actual =[r[1] for r in test]
print(metrics.accuracy_score(actual,predictions,normalize = True))
print(classification_report(actual, predictions))

Before we start our LDA we need to do some preprocessing, like removing stopwords, punctuation etc. For the LDA model we use the package gensim. 

In [None]:
rep_lda = usefuldf.loc[usefuldf['Color'] == 'Red', 'texts'].tolist()
dem_lda = usefuldf.loc[usefuldf['Color'] == 'Blue', 'texts'].tolist()
dem_texts_lda = []
for lines in dem_lda:
    dem_texts_lda.append(lines.split())
rep_texts_lda = []
for lines in rep_lda:
    rep_texts_lda.append(lines.split())

In [None]:
def processinglda(y):
    y = [[word.lower().strip(punctuation) for word in lines] for lines in y]       
    y = [[word for word in lines if word !=''] for lines in y]
    y = [[word for word in lines if word not in stopwords] for lines in y]
    return y 
rep_processed = processinglda(rep_texts_lda)
dem_processed = processinglda(dem_texts_lda)

In [None]:
#democrat state topic modeling
id2word = corpora.Dictionary(dem_processed)
mm = [id2word.doc2bow(word) for word in dem_processed]
lda_dem = models.ldamodel.LdaModel(corpus = mm, id2word = id2word, num_topics = 30, alpha = "auto")

The optimal number of topics was determined using the u_mass topic coherence measure, which produces a result of -2.61 for 30 topics for the Democrat dataset. While this score is not optimal, an increase and decrease in number of topics results in a lower coherence score. Attempts of topic modeling with the tfidf vectorizer produce coherence scores of around -10, depending on the number of topics. Hence, a normal count vectorizer is recommended for this data set.

In [None]:
cm1 = models.CoherenceModel(model=lda_dem, corpus= mm, dictionary= id2word, coherence='u_mass')  
mm_model = cm1.get_coherence()
print(mm_model)
print(lda_dem.print_topics())

An examination of the topics for the Democrat states incidents gives us some insight into why our classifier may work as well as it does - location mention in the articles. Among the most prominent topics we have frequent mention of Chicago, Bedford-Stuyvesant, Spokane, Seattle, Gresham, Lawndale, Englewood, Harlem, Belmont, and other cities or boroughs in primarily Democrat states. However, the presence of cities like Chesterfield, Austin and several other Republican strongholds puts this assumption into question. Regardless, the presence of Democrat strongholds is much more evident. 

In [None]:
#republican state topic modeling 
id2word = corpora.Dictionary(rep_processed)
mm = [id2word.doc2bow(word) for word in rep_processed]
lda_rep = models.ldamodel.LdaModel(corpus = mm, id2word = id2word, num_topics = 40, alpha = "auto")

The optimal number of topics was determined using the u_mass topic coherence measure, which produces a result of -2.47 for 30 topics. While this score is not optimal, an increase and decrease in number of topics results in a lower coherence score. Attempts of topic modeling with the tfidf vectorizer produce coherence scores of around -12, depending on the number of topics. Hence, a normal word count vectorizer is recommended for this data set.

In [None]:
cm1 = models.CoherenceModel(model=lda_rep, corpus= mm, dictionary= id2word, coherence='u_mass')  
mm_model_rep = cm1.get_coherence()
print(mm_model_rep)
print(lda_rep.print_topics())

Indeed, an observation of the topics in Republican state coverage, leads to the conclusion that many of the articles include references to  Republican strongholds or Republican leaning (in the 2016 elections) cities and states, such as Dallas, Bexar, Philadelphia which in the last election was Republican, as well as Greenville, Charleston, Milwaukee and Clayton. No reference to Democrat cities or locations is found.  

Overall the analysis using the LDA approach gives some evidence for concluding that the unexpected accuracy of both the Naive Bayes and logistic regression classifiers is largely due to the reference to locations in each corpus. No further substantially distinctive features were found between the topics in the two corpora that could justify the accuracy of the classifier. 