# Twitter Fingers == Trigger Fingers: A Look at Gun Violence

## Data Acquisition and Cleaning Part 1:
Data was gathered from one main source, gunviolencearchive.org. Each individual reporting of gun violence comes with sources to prove that it happened, as well as details on the number injured, killed, date, time, and place. The first data acquisition takes these variables, puts them into a dataframe then into csv, as well as finds the source and cleans the links so that they are easily accessible. 

In [6]:
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
import re
import numpy as np
import socket
import scipy as sc
import datetime
import matplotlib
import matplotlib.pyplot as plt
import statistics
import datetime

**Functions**

In [7]:
def transform_tables(pd_dataframe):
    # omit the hyperlink column that will be read as NA values
    new_dataframe = pd_dataframe.loc[:,"Incident ID":"# Injured"]
    # rename columns
    new_dataframe = new_dataframe.rename(columns = {"Incident ID": "ID", "Incident Date": "Date", 
                                    "State": "State", "City Or County": "City/County", 
                                    "Address": "Address", "# Killed": "Killed", 
                                    "# Injured": "Injured"})
    return new_dataframe

In [8]:
def save_html(url, path):
    response = requests.get(url)
    with open(path, "wb") as file:
        file.write(response.content)

In [9]:
def get_webpages(soup, year):
    last_webpage_href = soup.find('a', attrs={'title': "Go to last page"})
    last_webpage_path = last_webpage_href.get('href')
    number_of_other_pages = int(re.findall(r'%s(\d+)'%"page=", last_webpage_path)[0])
    if year in range(2014, 2016):
        webpage_paths = ['/reports/mass-shootings/'+ str(year)] # initialize with the first page's path
    else:
        webpage_paths = ['/reports/mass-shooting?year='+ str(year)]
    for page_number in range(1, number_of_other_pages + 1):
        path = re.sub(str(number_of_other_pages), str(page_number), last_webpage_path)
        webpage_paths.append(path)
    return webpage_paths

In [10]:
def get_news_sources(soup):
    news_hrefs = soup.findAll('a', attrs={'href': re.compile("^https://|^http://")})
    news_links = [tag.get('href') for tag in news_hrefs if tag.text == "View Source"] # get all sources listed on a page
    return news_links

In [11]:
def remove_nesting(nested_list):
    return [i for j in nested_list for i in j]

**Acquisition and Cleaning**

In [12]:
# get all pages paths

web_pages_paths = []
for year in range(2014, 2020):
    path = "mass_shooting_html_"+ str(year)
    soup = BeautifulSoup(open(path,'r'), 'html.parser')
    web_pages_paths.append(get_webpages(soup, year)) # including the first

In [13]:
# save first pages html

for year in range(2014, 2020):
    if year in range(2014, 2016):
        first_page_url = "https://www.gunviolencearchive.org/reports/mass-shootings/" + str(year)
    else:
        first_page_url = "https://www.gunviolencearchive.org/reports/mass-shooting?year=" + str(year)
    path = "mass_shooting_html_"+ str(year) 
    save_html(first_page_url, path)

In [14]:
sources_container = np.arange(2014, 2020, 1).tolist()
for year_index in range(len(sources_container)):
    year = 2014 + year_index
    sources_container[year_index] = []
    page_index = -1
    for path in web_pages_paths[year_index]:
        page_index += 1
        link = "https://www.gunviolencearchive.org" + path
        filename = "mass_shooting_html_"+ str(year) + "_page_" + str(page_index)
        #save_html(link, filename)
        soup = BeautifulSoup(open(filename,'r'), 'html.parser')
        this_page_sources = get_news_sources(soup)
        sources_container[year_index].append(this_page_sources) 
        
sources_container[5][8]

KeyboardInterrupt: 

In [None]:
news_2014 = remove_nesting(sources_container[0])
news_2015 = remove_nesting(sources_container[1])
news_2016 = remove_nesting(sources_container[2])
news_2017 = remove_nesting(sources_container[3])
news_2018 = remove_nesting(sources_container[4])
news_2019 = remove_nesting(sources_container[5])
news_2019[:5]

In [None]:
# get the report tables

annual_reports = []
for year in range(2014, 2020):
    first_page_url = "https://www.gunviolencearchive.org/reports/mass-shootings/" + str(year)
    csv_file = str(year) + "_mass_shootings.csv"
    this_year_report = pd.read_csv(csv_file)
    cleaned_report = transform_tables(this_year_report)
    annual_reports.append(cleaned_report)
    

In [None]:
ms_2014 = annual_reports[0]
ms_2015 = annual_reports[1]
ms_2016 = annual_reports[2]
ms_2017 = annual_reports[3]
ms_2018 = annual_reports[4]
ms_2019 = annual_reports[5]

ms_2019.head()

In [None]:
ms_2014['Source'] = news_2014
ms_2015['Source'] = news_2015
ms_2016['Source'] = news_2016
try:
    ms_2017['Source'] = news_2017
except Exception as e:
    pass
ms_2018['Source'] = news_2018

# ms_2019['Source'] = news_2019 
# gives error since one row does not have a source listed directly

index = news_2019.index("https://www.wcvb.com/article/6-people-shot-outside-of-roxbury-party-police-say/28306883") # index of where it is supposed to be 
news_2019.insert(index, "https://fox2now.com/2019/07/07/north-county-residents-on-edge-after-5-adults-found-dead-in-apartment/")
news_2019 = [news_2019[i] for i in range(len(news_2019)) if news_2019[i] != news_2019[i-1]] 
# in case of re-running the insert code and duplicating

ms_2019['Source'] = news_2019 
ms_2019[:10]

In [None]:
merged_data = pd.concat([ms_2014, ms_2015, ms_2016, ms_2017, ms_2018, ms_2019])
print(len(merged_data))
merged_data.to_csv(path_or_buf = "complete_project_dataset") # export as csv file

## Data Acquisition Part 2:
The second step of the data acquisition process was to access the sources, and find the article text from each source. This allows us to create a second set of data in a text file for analysis of all words from these articles, "articletext". 

In [None]:
#reads in complete project data
gunviolencedataset = pd.read_csv("complete_project_dataset")
sourceurl = gunviolencedataset["Source"] #creates a series of just the source urls

The function that takes the source url, goes to the url, reads in the text from that site. The function then writes the text from the "p" tag to a text file, so that it can be accessed afterwards. 

In [None]:
def getsourcetext(urlseries):
    #initialize beautifulsoup
    soup = BeautifulSoup('''<html>  </html>''', 'html.parser') 
    timeout = 20 #creates a timeout variable w an int
    socket.setdefaulttimeout(timeout) #uses timeout to set the socket timeout
    dictionary = {}
    for k in urlseries: #runs this loop for every entry in the series
        try:
            html = urllib.request.urlopen(url=k) #opens k website
            html = html.read() #reads in website info
            htmlfile = html.decode('utf-8') #decodes the info into a new file
        #need to make exceptions for 404/403/etc
        except Exception as e:
            continue
        else: #what to do after try block works or doesn't work
            singlesoup = BeautifulSoup(htmlfile, 'html.parser') #the variable that holds the data from the article
            full_text = ""
            for n in singlesoup("p"): #finds p tag (the main paragraph of the article)
                full_text += n.get_text(strip=True)#we only want the parts of the article that are from the main paragraph
        dictionary[k] = full_text  
    return dictionary

text_dict_of_1000 = getsourcetext(sourceurl[:1000]) # it takes 1 hour to work the first 1000 urls 

In [None]:
# Finalized dataframe and save as csv file
full_text_df = pd.DataFrame(list(text_dict_of_1000.items()),
                   columns=['Source', 'Text'])
merged_1000_df = pd.merge(gunviolencedataset, full_text_df, on = "Source")
merged_1000_df.to_csv(path_or_buf = "first_1000_dataset_with_text") 

# get the final version of the data by running this code below 

In [None]:
# Read in the created csv file above
imported_df = pd.read_csv("first_1000_dataset_with_text")
imported_df['Date'] = imported_df['Date'].apply(lambda x: datetime.datetime.strptime(x,"%B %d, %Y"))
imported_df = imported_df.sort_values(by = 'Date')
no_na = imported_df.dropna()

In [None]:
# visually inspection show text with 300 or less characters are not article body
from nltk.tokenize import RegexpTokenizer

n = 0
n1 = 0
m = []
s = 'string'
textseries = no_na["Text"]
eliminationlist = ["©", "Terms of Use", "theTerms", "Terms of Service", "Privacy Policy", "JavaScript", "Policy•CitizensNet"]
tokenizer = RegexpTokenizer('\s+', gaps=True)
indexlist = no_na.index
for string in no_na["Text"]:
    i = indexlist[n]
    i1 = no_na.loc[i].name
    if len(string) < 300:
        string = tokenizer.tokenize(string)
        for word in string:
            if (word in eliminationlist):
                m.append(i)
                break
        n1 += 1    
    n += 1
print(m), print(n), print(n1)
cleaned_imported_df = no_na.drop(labels=m, axis=0) 

In [None]:
data = no_na.loc[no_na["Text"].str.len() > 300]
data = data.iloc[:, 2:11]

print(data.shape)
data.head()

After scraping from the sources of the first 1000 incidents sorted in chronological order, we have all a total of 523 rows. This is due to various reason pertaining to web update: 
- urls did not work gives 404 status code
- article no longer exist
- video content

## Exploratory Analysis:
After obtaining and cleaning the data, we wanted to look at the statistics and visualization of the data gathered. 

In [None]:
dates = []
num_killed = []
num_injured = []

for row in range(len(data)):
    x = data.iloc[row,1]
    if x in dates:
        num_killed[len(num_killed)-1]+= data.iloc[row,5]
        num_injured[len(num_injured)-1]+= data.iloc[row,6]
    else:
        dates.append(x)
        num_killed.append(data.iloc[row,5])
        num_injured.append(data.iloc[row,6])
        

In [None]:
plt.plot(dates, num_injured, label = "People Injured")
plt.plot(dates, num_killed, label = "People Killed")

# naming the x axis
plt.xlabel('Dates')
# naming the y axis
plt.ylabel('Number of People')
# giving a title to my graph
plt.title('Number of People Killed or Injured by Guns')
  
# show a legend on the plot
plt.legend()
  
# function to show the plot
plt.show()

In [None]:
d = {'statistic' : ['mean','median','mode','std','min','max'], 
     'Number Killed' : [np.mean(num_killed),np.median(num_killed),statistics.mode(num_killed),np.std(num_killed),min(num_killed),max(num_killed)],
    'Number Injured' : [np.mean(num_injured),np.median(num_injured),statistics.mode(num_injured),np.std(num_injured),min(num_injured),max(num_injured)]
    }

stats_of_the_data = pd.DataFrame(data = d)
print(stats_of_the_data)

In [None]:
words = data["Text"].str.split()
words = words.map(lambda x: len(x))

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
axs[0].plot(data["Date"], words)
axs[0].set_title("Length of Article by Date")
axs[1].hist(words, bins = 50)
axs[1].set_title("Length of Article Histogram")
plt.show()

The length of the article seems to correlate with the trend in the number of injured or killed victims.

In [None]:
from textblob import TextBlob # simple library; sutiable for exploratory analysis
def polarity(text):
    return TextBlob(text).sentiment.polarity 

polarity_score = data['Text'].apply(lambda x : polarity(x))

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
axs[0].plot(data["Date"], polarity_score)
axs[0].set_title("Polarity Score by Date")
axs[1].hist(polarity_score)
axs[1].set_title("Polarity Score Histogram")
plt.show()

The polarity score distribution determined by TextBlob.sentiment.polarity is centered at 0 with a small deviation. The articles are mostly neutral.

## Analysis Method Outlines: N-grams and Sentiment Analysis Classification

**NLP Analysis:** We wanted to use Natural Language Processing to inspect the relations of words in the articles we find, in the hopes that we can relate fequency of word use and the context that these fequently used words are in to the increase of gun violence in America. To start this the article text extracted must be tokenized (sorted into words), remove the words and punctuation that are too common to be useful, and assess the frequency of the cleaned words. Then ngrams are created of these words to assess the context that the frequent words come in. 

What remains to be done is connect these results to collective meaning about the articles, and how that relates to gun violence. 

In [15]:
import nltk
from nltk import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import codecs
from nltk.tokenize import RegexpTokenizer
# make list of stopwords, numbers, and punctuation
stopwords = nltk.corpus.stopwords.words('english')
capstopwords = [w.title() for w in stopwords]
numbers = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "zero", ",", "'"]
stopwords.extend(capstopwords)
stopwords.extend(numbers)

In [16]:
# keywordstuple_list = []
# for article in data["Text"]:
#     tokenizer = RegexpTokenizer(r'\w+')
#     articletext = tokenizer.tokenize(article)
#     articletext = [word.lower() for word in articletext if word not in stopwords]
#     frequency_dist = FreqDist(articletext)
#     keywordstuple = frequency_dist.most_common(10)
#     keywordstuple_list.append(list(keywordstuple))

# keywordstuple_list 

In [34]:
def uncouplevector(vector):
    uncoupledvector = []
    for ftuple in vector:
        t1 = ftuple[0]
        t2 = int(ftuple[1])
        uncoupledvector.append(t1)
        uncoupledvector.append(t2)
    return uncoupledvector

def averageuncv(vector, N):
    n = 1
    for element in vector:
        if type(element) == type(0):
            element = element/N
            vector[n] = element
            n += 2
    return vector

def addtokeyworddf(vector, row):
    keyrow = keywordDF.iloc[row]
    n = 0
    for element in vector:
        keyrow[n] = element
        n += 1
    keywordDF.iloc[row] = keyrow
    

def bigramofDF(dataframe, column):
    bigramtuples = []
    n=0
    bigramdict={}
    textdata = dataframe[str(column)]
    tokenizer = RegexpTokenizer('/^[a-zA-Z ]*$/', gaps=True)
    for row in textdata:
        tokenizedarticle = tokenizer.tokenize(row)
        tokenizedarticle = [word.lower() for word in tokenizedarticle if word not in stopwords]
        bigrams = list(nltk.bigrams(tokenizedarticle))
        bigramfreq = FreqDist(bigrams)
        top10bigrams = FreqDist(bigrams).most_common(10)
        bigramN = bigramfreq.N()
        #adds to a 542/20 database of the most frequent bigrams
        addtobigramdf(top10bigrams, n, bigramN)
        
        article = "article" + str(n)
        bigrams.extend(article)
        article_bigrams = tuple(bigrams)
        #bigramtuples.append(article_bigrams)
        #score = bigrams.score_ngrams(bgm.likelihood_ratio)
        #likelihoodscores.append(score)
        print(article_bigrams)
        n +=1 
    return bigramtuples

def addtobigramdf(vector, row, N):
    keyrow = bigramDF.iloc[row]
    n = 0
    for element in vector:
        keyrow[n] = element[0]
        n+=1
        keyrow[n] = element[1]/N
        n+=1
    bigramDF.iloc[row] = keyrow
    

In [18]:
l = cleaned_imported_df.shape
keywordDF = np.zeros((l[0], 20))
keywordDF = pd.DataFrame(keywordDF)
n=0
m=0
vsum=[]
tokenizer = RegexpTokenizer('\s+', gaps=True)
for index, row in cleaned_imported_df.iterrows():
    tokenizedarticle = tokenizer.tokenize(row["Text"])
    articletext = [word.lower() for word in tokenizedarticle if word not in stopwords]
    frequencydist = FreqDist(articletext)
    wordsN = frequencydist.N()
    keywords = frequencydist.most_common(10)
    keywordvector = uncouplevector(keywords)
    averagedkeywords = averageuncv(keywordvector, wordsN)
    addtokeyworddf(averagedkeywords, n)
    n += 1
print(keywordDF)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  keyrow[n] = element


             0         1          2         3            4         5   \
0          erie  0.036697     police  0.036697          man  0.027523   
1     shootings  0.034483       shot  0.034483       police  0.025862   
2      shooting  0.034653       club  0.024752       police  0.014851   
3    wednesday,  0.025000       cook  0.025000    mcmillian  0.025000   
4        people  0.015464        los  0.015464      angeles  0.015464   
..          ...       ...        ...       ...          ...       ...   
537   2021march  0.022599     police  0.016949     cockrell  0.016949   
538     victims  0.030303       said  0.030303     shooting  0.018182   
539      county  0.014493  sheriff’s  0.014493       office  0.014493   
540         man  0.020101     police  0.020101     suffered  0.015075   
541        said  0.022044     family  0.020040  gramiccioni  0.016032   

            6         7            8         9            10        11  \
0     shooting  0.027523        hamot  0.027523  

In [19]:
firstmostcommonwords = keywordDF[0]
firstmostcommonwords.value_counts()
secondmostcommondwords = keywordDF[2]
secondmostcommondwords.value_counts()


police        62
shooting      38
said          27
shot          20
people        13
              ..
sheriff's      1
medical        1
paramedics     1
sheriff’s      1
center         1
Name: 2, Length: 268, dtype: int64

In [20]:
#create a graph/track the most common words (5) in article over time(by month?), what do we see(?)
#create a graph/ look at/for time lag. Does a word's increase over the course of a few days/weeks (n=1,2,3,4,5,6,7) 
# do we need more data to fill in data enough for the time lag analysis? 


In [35]:
from nltk.classify.util import apply_features, accuracy as eval_accuracy
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import (
    BigramAssocMeasures,
    precision as eval_precision,
    recall as eval_recall,
    f_measure as eval_f_measure,
)
bgm    = nltk.collocations.BigramAssocMeasures()
bigramDF = np.zeros((l[0], 20))
bigramDF = pd.DataFrame(bigramDF)
from nltk.probability import FreqDist
#ngram analysis of text
likelihoodscores = []
tuplelist = []

    #create a dictionary, key is article#, value is all bigrams

bigramdict = bigramofDF(cleaned_imported_df, "Text")
bigramdict[0]

('a', 'r', 't', 'i', 'c', 'l', 'e', '0')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1')
('a', 'r', 't', 'i', 'c', 'l', 'e', '2')
('a', 'r', 't', 'i', 'c', 'l', 'e', '3')
('a', 'r', 't', 'i', 'c', 'l', 'e', '4')
('a', 'r', 't', 'i', 'c', 'l', 'e', '5')
('a', 'r', 't', 'i', 'c', 'l', 'e', '6')
('a', 'r', 't', 'i', 'c', 'l', 'e', '7')
('a', 'r', 't', 'i', 'c', 'l', 'e', '8')
('a', 'r', 't', 'i', 'c', 'l', 'e', '9')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '0')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '1')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '2')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '3')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '4')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '5')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '6')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '7')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '8')
('a', 'r', 't', 'i', 'c', 'l', 'e', '1', '9')
('a', 'r', 't', 'i', 'c', 'l', 'e', '2', '0')
('a', 'r', 't', 'i', 'c', 'l', 'e', '2', '1')
('a', 'r', 't', 'i', 'c', 'l', 'e', '2

IndexError: list index out of range

In [None]:
#add death/injury data to article data
cleaned_imported_df
index1 = cleaned_imported_df.index
labels = ['ID', 'Date', 'State', 'City/County', 'Address', 'Killed', 'Injured']
interimdf = cleaned_imported_df[labels]
#unigram frequencies
keywordlabels = ["1stMCW", "1st MCW Freq", "2nd MCW", "2nd MCW Freq", "3rd MCW", "3rd MCW Freq", "4th MCW", "4th MCW Freq", "5th MCW", "5th MCW Freq", "6th MCW", "6th MCW Freq", "7th MCW", "7th MCW Freq", "8th MCW", "8th MCW Freq", "9th MCW", "9th MCW Freq", "10th MCW", "10th MCW Freq"]
keywordDF.columns = keywordlabels
keywordDF.index = index1
unigramfreq = pd.concat([interimdf, keywordDF], axis=1)
unigramfreq.to_csv("UnigramFrequencies")
#bigram frequencies
keywordlabels = ["1st MFB", "1st MFB Freq", "2nd MFB", "2nd MFB Freq", "3rd MFB", "3rd MFB Freq", "4th MFB", "4th MFB Freq", "5th MFB", "5th MFB Freq", "6th MFB", "6th MFB Freq", "7th MFB", "7th MFB Freq", "8th MFB", "8th MFB Freq", "9th MFB", "9th MFB Freq", "10th MFB", "10th MFB Freq"]
bigramDF.columns = keywordlabels
bigramDF.index = index1
bigramfreq = pd.concat([bigramDF, interimdf], axis=1) 
bigramfreq.to_csv("BigramFrequencies")
#need to organize by date
grouplabel = []

In [None]:
dategroupedunigram = unigramfreq.groupby("Date")
namelist=[]
for name,group in dategroupdunigram:
    n1
    namelist.append(n1)
dategroupedunigram = unigramfreq.aggregate(by=nseries)
timestamp = " 00:00:00" #add this to index to access data, date is format 2000-01-31
#groupedunigram = np.zeros([374,27], dtype=int)
#unigrambydate = pd.DataFrame(unigrambydate)
nseries = pd.DataFrame(np.zeros([374,1]))
print(dategroupedunigram)
#unigrambydate = pd.DataFrame(dategroupedunigram)
#unigrambydate = unigrambydate[1][1].unstack()
#i = unigrambydate.index
#unigrambydate['ID', 119]
#groupedunigram[0] = unigrambydate[i]

In [None]:
print(bigramDF["1st MFB"].value_counts())
#print(bigramDF["2nd MFB"].value_counts())
#print(bigramDF[4].value_counts())
#print(bigramDF[6].value_counts())
#print(bigramDF[10].value_counts())
#print(bigramDF[12].value_counts())
#print(bigramDF[14].value_counts())
#print(bigramDF[16].value_counts())
#print(bigramDF[18].value_counts())
grouplabel


I have most frequent word count and how often it appears in its text, for uni and bi gram. I want to do tri-gram, might not be necessary so I'll do it later. 

Does the frequency of certain words go up or down by time? Does rates of death/injury correlate to the frequency changes? I think this is a multiple linear regression

In [40]:
import statsmodels.formula.api as sm
singlewordfreq = pd.read_csv("UnigramFrequencies")
unigramols = sm.ols(formula="Killed ~ '1st MCW Freq'", data=singlewordfreq).fit()
unigramols.summary()

PatsyError: Number of rows mismatch between data argument and '1st MCW Freq' (542 versus 1)
    Killed ~ '1st MCW Freq'
             ^^^^^^^^^^^^^^

**Sentiment Analysis:** Our second kind of analysis for the article is going to be sentiment analysis classification. We created a classification model that trains on a library of tweets that are rated from 0 through 4 on a negative-positive scale. The tweet is vectorized into features and its rating. 

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
try:
    from gensim.models import Word2Vec
except Exception as e:
    pass
from collections import Counter

import nltk
#nltk.download('all')

Importing and trimming twitter sentiment dataset

Insert header, remove numbers, usernames, and NO_QUERY

Move first row that became header down to data, add headers

In [None]:
df = pd.read_csv('training.1600000.processed.noemoticon.csv',encoding = "ISO-8859-1",names=["score","id","datetime","NO_QUERY","usernames","tweet"])

df = df.drop(['id','NO_QUERY','usernames'],axis=1)

df.head()

Checking count values, dataset listed from 0 = negative to 4 = positive. 

In [None]:
df = df.iloc[0:20000, :]
print(df.shape)
print(df['score'].value_counts())

Clean up links, @users, hastags

In [None]:
stopwords = nltk.corpus.stopwords.words("english")

def cleanup(text):
    clean = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",str(text)).split()
    tokens = []
    for token in clean:
        if token not in stopwords:
            tokens.append(token)
    return tokens

In [None]:
df['tweet'] = df['tweet'].apply(lambda x: cleanup(x))

df['score'] = df['score'].replace([0,4],['neg','pos']) 

df.head()

In [None]:
def to_tuple(x):
    subset = x[['tweet','score']]
    tuples = [tuple(i) for i in subset.to_numpy()]
    return tuples

In [None]:
documents = to_tuple(df)
documents[0] # tuple of tokens and score

In [None]:
# def word_master(x):
#     master_list = []
#     for i in range(len(x)):
#         master_list += x['tweet'][i] 
#     return master_list

words = remove_nesting(df['tweet'])

word_features = nltk.FreqDist(w.lower() for w in words).most_common(2000)
word_features =  [word_tuple[0] for word_tuple in word_features]
word_features[0:5] 

In [None]:
def document_features(document):    
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains('+ word +')'] = (word in document_words) 
    return features

featuresets = [(document_features(d), c) for (d,c) in documents] 
featuresets[1]

The words as feature vectors have been computed, this then establishes the training and test sets. Then runs the classification model on the test set. 

What remains to be done here is test the model on the article data which also needs to be turned into feature vectors. 

In [None]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(10)

In [None]:
text_labels = []
for article in data["Text"]:
    tokenizer = RegexpTokenizer(r'\w+')
    articletext = tokenizer.tokenize(article)
    articletext = [word.lower() for word in articletext if word not in stopwords]
    text_labels.append(classifier.classify(document_features(article)))

In [None]:
plt.hist(text_labels)

# Ethical Considerations

The gun violence data is collected manually from reliable news and police reports in an organized manner and is intended for public use. Since we are not republishing any material or using any metadata, there is also no conflict with the Rights and Limit of Use stated by the aforementioned news outlets.

There could be some unintended bias in the data collection: all incidents with dysfunctional links were ommitted. It would be very time-consuming and ineffective to find substitute urls otherwise. 

We are working with statistics surrounding violent death or injury, it must be remembered that 