# Introduction

As a resource for social data, Twitter’s platform has been used to measure the quality of life through sentiment analysis. This capstone project explores another methodological technique of using specific keyword terms to determine dominant topics, word patterns, and sentiment leanings in a geographical area. Focusing on New York City and Los Angeles for comparative analysis, the keyword term “why” will be used to build a Python analysis around topic modeling and sentiment analysis. With this approach, the analysis reveals social and cultural differences, the overall sentiment of tweets, and areas of interest to tweeters.

### Contents
1. Install Libraries
2. Import Python Libraries
3. Data Setup *(import, query, convert JSON to DataFrame, and clean Twitter data)*
4. Categorizing Sentiment on Tweets
5. Exploratory Analysis
6. Topic Modeling
7. Topic Bubble Map
8. Accuracy of Sentiment Analysis

# Install libraries

If you find that you are missing any libraries after importing them in the next step, please use this section to install them. For the nltk.download() functions, you are able to download them AFTER importing the NLTK library.

# Python Libraries

In [None]:
import searchtweets as twitter

import pandas as pd
from pandas.io.json import json_normalize
import numpy as np

import re
import contractions
import string
import textwrap

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import plotly.graph_objs as go
import plotly.express as px
import geopandas as gpd

import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

from textblob import TextBlob

import sklearn
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.preprocessing import normalize

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

import warnings
from warnings import simplefilter
warnings.filterwarnings('ignore')
simplefilter(action='ignore', category=FutureWarning)

# Data Setup

## Import Twitter Data

The file *(twitter_keys.yaml)* in this code needs to be edited with your own tokens. Twitter does not allow the sharing of tokens.

In [None]:
#file contains API and Bearer tokens
search_args = twitter.load_credentials('twitter_keys.yaml',
                                     yaml_key='search_tweets_v2',
                                     env_overwrite=False)

## Query Twitter Data

To set up search terms for query. Twitter offers [documention](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) on building a query.

In [None]:
#set up parameters for query search
query_whynyc = twitter.gen_request_parameters(
    query = 'why (place:01a9a39529b27f36 OR place:011add077f4d2da3 OR place:00c39537733fa112 OR place:002e24c6736f069d OR place:00c55f041e27dc51) -is:retweet -is:nullcast lang:en',
    results_per_call = 500,
    start_time = '2021-01-01', 
    end_time = '2022-01-01',
    tweet_fields = 'id,created_at,text,geo',
    granularity=''
)

#140k to get all of 2021
whynyc_tweets = twitter.collect_results(
    query_whynyc,
    max_tweets=140000, 
    result_stream_args=search_args
)

In [None]:
query_whyla = twitter.gen_request_parameters(
    query = 'why (place:3b77caf94bfc81fe) -is:retweet -is:nullcast lang:en',
    results_per_call = 500,
    start_time = '2021-01-01', 
    end_time = '2022-01-01',
    tweet_fields = 'id,created_at,text,geo',
    granularity=''
)

#90k
whyla_tweets = twitter.collect_results(
    query_whyla, 
    max_tweets=90000, 
    result_stream_args=search_args 
)

## Convert JSON to Dataframe

In [None]:
whynyc_df = pd.json_normalize(whynyc_tweets, record_path=['data'])
whyla_df = pd.json_normalize(whyla_tweets, record_path=['data'])

## Data Cleaning

In [None]:
def find_links(tweet):
    #function  extracts the links
    return re.findall('(http\S+|bit.ly/\S+)', tweet)

#def find_retweeted(tweet):
   # function finds and extracts retweeted twitter handles
    #return re.findall('(?<=RT\s)(@[A-Za-z0-9]+[A-Za-z0-9-_]+)', tweet)

def find_mentioned(tweet):
    #function finds and extracts the twitter handles of people mentioned
    return re.findall('(?<!RT\s)(@[A-Za-z0-9]+[A-Za-z0-9-_]+)', tweet)  

def find_hashtags(tweet):
    #This function will extract hashtags
    return re.findall('(#[A-Za-z0-9]+[A-Za-z0-9-_]+)', tweet)   

In [None]:
# make new columns for links, retweeted usernames, mentioned usernames and hashtags
whynyc_df['links'] = whynyc_df.text.apply(find_links)
#whynyc_df['retweeted'] = whynyc_df.text.apply(find_retweeted)
whynyc_df['mentioned'] = whynyc_df.text.apply(find_mentioned)
whynyc_df['hashtags'] = whynyc_df.text.apply(find_hashtags)

whyla_df['links'] = whyla_df.text.apply(find_links)
#whyla_df['retweeted'] = whyla_df.text.apply(find_retweeted)
whyla_df['mentioned'] = whyla_df.text.apply(find_mentioned)
whyla_df['hashtags'] = whyla_df.text.apply(find_hashtags)

In [None]:
#to clean up the ['text'] column

stopwords = nltk.corpus.stopwords.words('english') 
lemmatizer = WordNetLemmatizer() #groups together similar words as a single term
punctuation = string.punctuation #'!'$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
symbol = '—…«»“”‘’' #for symbols not captured in punctuation

def clean_df(tweet):
    #remove parts of a tweet
    tweet = re.sub(r'http\S+', '', tweet) #removes links
    tweet = re.sub(r'bit.ly/\S+', '', tweet) #removes bitly links
    #tweet = re.sub('(RT\s@[A-Za-z0-9]+[A-Za-z0-9-_]+)', '', tweet) #removes retweeted usernames
    tweet = re.sub('(@[A-Za-z0-9]+[A-Za-z0-9-_]+)', '', tweet) #removes mentioned usernames
    #removes hashtags, for this analysis they are kept in
    #tweet = re.sub('(#[A-Za-z0-9]+[A-Za-z0-9-_]+)', '', tweet) 
    
    #removing these that showup after data cleaning processing
    tweet = re.sub('&amp;', '&', tweet)
    tweet = re.sub('\n', '', tweet)

    #lower-case characters
    tweet = tweet.lower()
    
    #remove contractions
    tweet = contractions.fix(tweet)
    
    #remove numbers
    tweet = re.sub('([0-9]+)', '', tweet)
    
    #remove punctuation
    tweet = re.sub('['+ string.punctuation +']+', ' ', tweet)
    
    #remove symbols not captured in punctuation
    tweet = re.sub('['+ symbol +']+', ' ', tweet)
    
    #remove whitespace
    tweet = re.sub(r'^\s+|\s+$', '', tweet)
    tweet = re.sub(r'\s+', ' ', tweet)
    
    #tokenize words and remove stopwords
    tweet_token_list = [word for word in tweet.split(' ')#]
                            if word not in stopwords] # remove stopwords

    #apply word lemmatization
    tweet_token_list = [lemmatizer.lemmatize(word) if '#' not in word else word
                        for word in tweet_token_list]
    
    tweet = ' '.join(tweet_token_list)
    return tweet

#create a new column for the cleaned text column.
whynyc_df['corpora'] = whynyc_df.text.apply(clean_df)
whyla_df['corpora'] = whyla_df.text.apply(clean_df)

In [None]:
#pull list of columns
list(whynyc_df)

In [None]:
#reorder columns in dataframe
whynyc = whynyc_df[['id', 'created_at', 'text', 'corpora', 'geo.place_id']]
whyla = whyla_df[['id', 'created_at', 'text', 'corpora', 'geo.place_id']]

#created for a one-time analysis
#adding this to whynyc and whyla dataframes may cause kernel to crash due to size
whynyc_pot = whynyc_df[['mentioned', 'hashtags', 'links']]
whyla_pot = whyla_df[['mentioned', 'hashtags', 'links']]

## Categorizing Sentiment on Tweets

Uses the NLTK and TextBlob libraries to calculate the polarity/sentiment (NLTK & TextBlob) and subjectivity (TextBlob only) scores of tweets.

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
def add_sentiment(why_df):
    #pulling polarity scores from NLTK library
    sa_nltk_list = []
    for i in why_df['text']:
        sa_nltk_list.append((sid.polarity_scores(str(i)))['compound'])
    #why_df['score'] = pd.Series(sa_nltk_list, dtype='float64')
    
    #pulling subjectivity and polarity scores from TextBlob
    def subjectivity(text): 
        return TextBlob(text).sentiment.subjectivity
    why_df['subjectivity'] = why_df['text'].apply(subjectivity)
    
    #Create a function to get the polarity
    sa_tb_list = []
    def polarity(text): 
        return TextBlob(text).sentiment.polarity
    sa_tb_list = why_df['text'].apply(polarity)
    
    #average of NLTK and TextBlob's polarity scores via Numpy
    avg = []
    avg = np.mean(np.array([sa_nltk_list, sa_tb_list]), axis=0)
    why_df['score'] = pd.DataFrame(avg)
    
    #Categorizing sentiment scores
    def sentiment_category(sentiment):
        label = ''
        if(sentiment>0):
            label = 'positive'
        elif(sentiment == 0):
            label = 'neutral'
        else:
            label = 'negative'    
        return(label)
    why_df['sentiment'] = why_df['score'].apply(sentiment_category)
    
    return why_df

whynyc = add_sentiment(whynyc)
whyla = add_sentiment(whyla)

# Exploratory Analysis

Due to links, mentions, and hahtags accounting for less than 1/3 of the total tweets queried, this portion will only provide a basic idea of how much relevance the parts of a tweet (below) play a role. In a future project, a network analysis will come into play for this part.discourse.
1. Links (either to an image or a website)
2. Mentioned
3. Hashtags



<table>
  <thead>
    <tr>
      <th> </th>
      <th>New York City</th>
      <th>Los Angeles</th>
      <th>% of tweets (NYC/LA)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Links accounts for </td>
      <td>41,566 tweets</td>
      <td>24,497 tweets</td>
      <td>(30%/27%)</td>
    </tr>
    <tr>
      <td>Mentioned tweeters account for</td>
      <td>46,067 tweets</td>
      <td>28,535 tweets</td>
      <td>(33%/32%)</td>
    </tr>
    <tr>
      <td>Hashtags account for</td>
      <td>9,180 tweets</td>
      <td>5,537 tweets</td>
      <td>(7%/6%)</td>
    </tr>
  </tbody>
</table>

In [None]:
len(whynyc)

In [None]:
len(whyla)

In [None]:
whynyc_pot['links'].value_counts()

In [None]:
whynyc_pot['mentioned'].value_counts()

In [None]:
whynyc_pot['hashtags'].value_counts()

In [None]:
whyla_pot['links'].value_counts()

In [None]:
whyla_pot['mentioned'].value_counts()

In [None]:
whyla_pot['hashtags'].value_counts()

## Text Analysis

This portion converts the 'created_at' column into datatime format with to_datatime() function in pandas. Length of tweets and a basic time series analysis is performed.

In [None]:
def format_why_df(why_df):
    #convert created_at column to datetime
    why_df['datetime'] = pd.to_datetime(why_df['created_at'], errors='coerce')

    #create a day column
    why_df['day'] = why_df['datetime'].dt.date

    #create a month column
    why_df['month'] = why_df['datetime'].dt.month
    
    #break up text column into length
    why_df['length']=why_df['text'].apply(lambda x:len(x.split()))
    return why_df

whynyc = format_why_df(whynyc)
whyla = format_why_df(whyla)

In [None]:
whynyc['length'].describe()

In [None]:
whyla['length'].describe()

In [None]:
#set colors for tweets categorized as positive, neutral, or negative
sentiment_colors = {
    'positive': '#2A9D8F',
    'neutral': '#847979',
    'negative': '#F4A259'}

In [None]:
#creates a graph of length of tweets by sentiment in a histogram
px.histogram(
    whynyc, 
    x='length', 
    color='sentiment', 
    color_discrete_map = sentiment_colors)

In [None]:
px.histogram(
    whyla, 
    x='length', 
    color='sentiment', 
    color_discrete_map = sentiment_colors)

In [None]:
#convert dataframe to lists
whynyc_list = whynyc['corpora'].values.tolist()
whynyc_list = ' '.join(whynyc_list).lower()

whyla_list = whyla['corpora'].values.tolist() 
whyla_list = ' '.join(whyla_list).lower()

In [None]:
#create a frequency distribution and graph it
fdist_whynyc = FreqDist(word_tokenize(whynyc_list))
plt.figure(figsize=(10, 4))
fdist_whynyc.plot(30, cumulative=False)
plt.show()

In [None]:
fdist_whyla = FreqDist(word_tokenize(whyla_list))
plt.figure(figsize=(10, 4))
fdist_whyla.plot(30,cumulative=False)
plt.show()

## Text Analysis - Word Clouds

Creates a word cloud based on the text ('corpora') data for NYC and LA.

In [None]:
def create_wordcloud(file_name, list_name):
    #pull the image file
    mask = np.array(Image.open(file_name))
    #function converts RGB values from 0 (black) to white (255)
    def transform_zeros(val):
        if val == 0:
            return 255
        else:
            return val
    #map and create a mask for image
    maskable_image = np.ndarray((mask.shape[0],mask.shape[1]), np.int32)
    for i in range(len(mask)):
        maskable_image[i] = list(map(transform_zeros, mask[i]))

    #create word cloud
    wordcloud = WordCloud(
        width = 3000, 
        height = 2000, 
        #random_state=1, 
        background_color='white', 
        colormap='twilight_r', 
        contour_width = 1,
        contour_color = '#111954',
        collocations=True, 
        stopwords = STOPWORDS, 
        mask=maskable_image).generate(list_name)

    def plot_cloud(wordcloud):
        # Set figure size
        plt.figure(figsize=(15, 7))
        # Display image
        plt.imshow(wordcloud) 
        # No axis details
        plt.axis('off');
    return plot_cloud(wordcloud)

create_wordcloud('new-york-city.png', whynyc_list)
create_wordcloud('los-angeles.png', whyla_list)

## N-grams

N-grams are continuous sequences of a neighbouring sequences of terms in a document. This section will look at n-grams up to 4.

In [None]:
def get_ngrams(text, ngram_from=2, ngram_to=2, n=None, max_features=20000):
    
    vectorizer = TfidfVectorizer(ngram_range = (ngram_from, ngram_to), 
                          max_features = max_features, 
                          stop_words='english').fit(text)
    bag_of_words = vectorizer.transform(text)
    sum_words = bag_of_words.sum(axis = 0) 
    words_freq = [(word, sum_words[0, i]) for word, i in vectorizer.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
   
    return words_freq[:n]

In [None]:
def ngrams_table(why_df):
    unigrams = pd.DataFrame(get_ngrams(why_df['corpora'], ngram_from=1, ngram_to=1, n=15))
    bigrams = pd.DataFrame(get_ngrams(why_df['corpora'], ngram_from=2, ngram_to=2, n=15))
    trigrams = pd.DataFrame(get_ngrams(why_df['corpora'], ngram_from=3, ngram_to=3, n=15))
    quadgrams = pd.DataFrame(get_ngrams(why_df['corpora'], ngram_from=4, ngram_to=4, n=15))
    
    ngrams = pd.concat([unigrams, bigrams, trigrams, quadgrams], axis = 1)
    ngrams.columns = ['unigrams', 'frequency', 'bigrams', 'frequency', 'trigrams', 'frequency', 'quadgrams', 'frequency']
    ngrams
    return ngrams

ngrams_nyc = ngrams_table(whynyc)
ngrams_la = ngrams_table(whyla)

In [None]:
ngrams_nyc

In [None]:
ngrams_la

## Sentiment EDA

This section is an exploratory data analysis of the sentiment on tweets.

In [None]:
whynyc['score'].describe()

In [None]:
whynyc['sentiment'].describe()

In [None]:
fig, ax = plt.subplots()
whynyc['sentiment'].value_counts().plot(ax=ax, kind='bar', xlabel='numbers', ylabel='frequency')
plt.show()

In [None]:
fig, ax = plt.subplots()
whyla['sentiment'].value_counts().plot(ax=ax, kind='bar', xlabel='numbers', ylabel='frequency')
plt.show()

In [None]:
def pol_and_sub_of_tweets(title, why_df):
    # plot the polarity and subjectivity
    fig = px.scatter(why_df, 
                     x='score', 
                     y='subjectivity', 
                     color = 'sentiment',
                     color_discrete_map = sentiment_colors,
                     size='subjectivity',
                     hover_name=why_df.text.apply(lambda txt: '<br>'.join(textwrap.wrap(txt, width=35))))

    #add a vertical line at x=0 for Netural Reviews
    fig.update_layout(title=title,
                      shapes=[dict(type= 'line',
                                   yref= 'paper', y0= 0, y1= 1, 
                                   xref= 'x', x0= 0, x1= 0)])
    return fig.show()

pol_and_sub_of_tweets('Subjectivity and Polarity Scores of Tweets in NYC', whynyc)
pol_and_sub_of_tweets('Subjectivity and Polarity Scores of Tweets in LA', whyla)

In [None]:
def sentiment_count(why_df):
    sent_count = why_df.groupby('day').sentiment.value_counts()
    sent_count = sent_count.to_frame(name='count')
    sent_count.reset_index(inplace=True)
    return sent_count

whynyc_sent = sentiment_count(whynyc)
whyla_sent = sentiment_count(whyla)

In [None]:
def sentiment_plotly(why_df):
    fig = go.Figure()
    for c in why_df['sentiment'].unique()[:3]:
        dfp = why_df[why_df['sentiment']==c].pivot(
            index='day', 
            columns='sentiment', 
            values='count')
        
        fig.add_traces(
            go.Scatter(
                x=dfp.index, 
                y=dfp[c], 
                mode='lines', 
                name = c))
    return fig.show()

sentiment_plotly(whynyc_sent)
sentiment_plotly(whyla_sent)

The next few word clouds explore the spikes in sentiment on particular days.

In [None]:
nyc_neg_wc_j6 = whynyc[(whynyc['datetime']>='2021-01-05') & (whynyc['datetime']<='2021-01-08') & (whynyc['sentiment']=='negative')]
wordcloud = WordCloud(max_font_size=50, max_words=500, background_color='white').generate(str(nyc_neg_wc_j6['corpora']))
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
la_neg_wc_j6 = whyla[(whyla['datetime']>='2021-01-05') & (whyla['datetime']<='2021-01-08') & (whyla['sentiment']=='negative')]
wordcloud = WordCloud(max_font_size=50, max_words=500, background_color='white').generate(str(la_neg_wc_j6['corpora']))
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
positive_wc_m29 = whynyc[(whynyc['datetime']>='2021-03-29') & (whynyc['datetime']<='2021-03-30') & (whynyc['sentiment']=='positive')]
wordcloud = WordCloud(max_font_size=50, max_words=500, background_color='white').generate(str(positive_wc_m29['corpora']))
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
def tweets_timeline_plotly(why_df):
    fig = px.scatter(why_df,
                 x='datetime',
                 y='length',
                 #range_y=[50, 120],
                 color_discrete_map = sentiment_colors,
                 color='sentiment',
                 size='length',
                 hover_name=why_df.text.apply(lambda txt: '<br>'.join(textwrap.wrap(txt, width=35)))
                )
    return fig.show()

tweets_timeline_plotly(whynyc)
tweets_timeline_plotly(whyla)

# Topic Modeling

Topic modeling is a text mining tool to reveal semantic structures of a body of text to reveal abstract topics that occur. It is a probabalistic model that will document which specific topic has certain words appearing more frequently than others. From the scikit-learn library, the Latent Dirichlet Allocation and TfidfVectorizer are used to build this model.

In [None]:
def why_lda_model(why_df):
    vectorizer = TfidfVectorizer(max_df=0.9, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+', ngram_range=(1,3))#max_df=0.9, min_df=25,)

    # apply transformation
    tf = vectorizer.fit_transform(why_df['corpora']).toarray()

    # tf_feature_names tells us what word each column in the matric represents
    tf_feature_names = vectorizer.get_feature_names()

    number_of_topics = 30

    model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
    model.fit(tf)

    #creates a table of the topics and weight of the document
    def display_topics(model, feature_names, no_top_words):
        topic_dict = {}
        for topic_idx, topic in enumerate(model.components_):
            topic_dict['Topic %d words' % (topic_idx)]= ['{}'.format(feature_names[i])
                            for i in topic.argsort()[:-no_top_words - 1:-1]]
            topic_dict['Topic %d weights' % (topic_idx)]= ['{:.1f}'.format(topic[i])
                            for i in topic.argsort()[:-no_top_words - 1:-1]]
        return pd.DataFrame(topic_dict)

    no_top_words = 15
    return display_topics(model, tf_feature_names, no_top_words).T
    
whynyc_lda = why_lda_model(whynyc)   
whyla_lda = why_lda_model(whyla) 

In [None]:
def collect_topics(x, why_df):
    #creates a dataframe of the extracted topics from the previous cell box
    topic_df = pd.DataFrame()

    #extracts the first column of topics for every other row
    topic_df['topic'] = x.iloc[::2, :1].reset_index(drop=True)
    #extracts the first column of weights for every other row
    topic_df['weight'] = x.iloc[1::2, :1].reset_index(drop=True)
    #combines the other columns of topics sans the calculated weight
    topic_df['subtopics'] = x.iloc[::2, 1:].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1).reset_index(drop=True)

    #calculates the overall average of sentiment (polarity) based on the topic extracted in line 6
    values = []
    for word in topic_df['topic']:     
        temp_list = why_df.loc[why_df['text'].str.contains(word, case=False)].reset_index()
        mean_of_topic = temp_list['score'].mean()
        values.append(mean_of_topic)
        values = [0 if x != x else x for x in values]
        topic_df['sentiment'] = pd.DataFrame(values)
        topic_df = topic_df.sort_values(by=['weight'], ascending=False, ignore_index=True)
        
        #categorizw the sentiment based on score
        def sentiment_category(sentiment):
            label = ''
            if(sentiment>0):
                label = 'positive'
            elif(sentiment == 0):
                label = 'neutral'
            else:
                label = 'negative'    
            return(label)
    
    topic_df['category'] = topic_df['sentiment'].apply(sentiment_category)   
    return topic_df
      
whynyc_topics = collect_topics(whynyc_lda, whynyc)
whyla_topics = collect_topics(whyla_lda, whyla)

In [None]:
#to display all the items in the subtopics column
pd.set_option('display.max_colwidth', 0)

In [None]:
whynyc_topics

In [None]:
whyla_topics

# Topic Bubble Map

This data visualization creates a topic bubble map based on the results in the *Topic Modeling* section. Geopandas is used to generate a set of random latitude and longitude coordinates within the geographical boundaries of the area of interest. Plotly is used to graph everything together.

In [None]:
def topic_map(file_name, why_df, export_name, title):
    #read GeoJSON file
    gdf_polys = gpd.read_file(file_name)

    # find the bounds of your geodataframe
    x_min, y_min, x_max, y_max = gdf_polys.total_bounds
    # set sample size
    n = 100
    # generate random data within the bounds
    x = np.random.uniform(x_min, x_max, n)
    y = np.random.uniform(y_min, y_max, n)

    # convert them to a points GeoSeries
    gdf_points = gpd.GeoSeries(gpd.points_from_xy(x, y))
    # only keep those points within polygons
    gdf_points = gdf_points[gdf_points.within(gdf_polys.unary_union)]
    
    #reset thet index of both dataframes and add the lat and lon columns
    gdf_points = gdf_points.reset_index(drop=True)
    why_df = why_df.reset_index(drop=True)
    why_df['lon'] = gdf_points.geometry.apply(lambda p: p.x)
    why_df['lat'] = gdf_points.geometry.apply(lambda p: p.y)
    
    #create the geographical scatter plot
    fig = px.scatter_mapbox(
        why_df,
        lat=why_df['lat'],
        lon=why_df['lon'],
        zoom=8.5,
        hover_name=why_df['topic'],
        text = why_df.subtopics.apply(lambda txt: '<br>'.join(textwrap.wrap(txt, width=35))),
        width = 600,
        height = 700,
    )
    
    #set color of scatter points
    def SetColor(x):
        if(x < 0):
            return '#F4A259'
        elif(x == 0):
            return '#847979'
        elif(x > 0):
            return '#2A9D8F'

    #set scatter points
    fig.update_traces(
        mode='markers+text',
        
        marker=dict(
            color= list(map(SetColor, why_df['sentiment'])),
            size=why_df['weight'].astype(float)/7,
        ),
        
        showlegend=True,
        hovertemplate= '<b>Topic: '+ why_df.topic + '</b><br><br>' 
            + 'The associated words with this topic are: <br>%{text}<br><br>'
            + 'The overall sentiment is '
            + why_df.category + '.',
    )

    #update and customize map
    fig.update_layout(
        mapbox = {
            'style': 'carto-positron',
            'layers': [
                {
                'source': file_name,
                'type': 'fill',
                    'below': 'traces',
                    'color': '#111954',
                    'opacity': 0.4,
                    'line': {'width': 5}
                } 
            ]},
        hoverlabel=dict(
            bgcolor='white',
            bordercolor='white',
            font=dict(color='black'),
            font_size=12, 
            font_family='Helvetica', 
            align='left'
        ),
        legend_title_text = '<b>'+ title + '</b>',
        legend=dict(
            orientation='h',
            yanchor='top',
            y=0.99,
            xanchor='left',
            x=0.01
        ),

    )
    
    #write Plotly graph to an HTML file
    fig.write_html(export_name, full_html=False, include_plotlyjs='cdn')

    fig.show()
    return topic_map

nyc_map = topic_map('nyc.geojson', whynyc_topics, 'nyc.html', 'Why, New York City?')
la_map = topic_map('la.geojson', whyla_topics, 'la.html', 'Why, Los Angeles?')

# Accuracy of Sentiment Analysis

This purpose of this section is to determine whether or not the methodology behind the sentiment analysis is accurate using Logisitic Regression and Multinomial Naive Bayes from the sklearn library.

For a recap, to get both the subjectivity and polarity scores on tweets, the NLTK and TextBlob libraries were used. NLTK does not have a number associate with their subjectivity library; therefore, the mean was calculated for the NLTK and TextBlob polarity scores. Lastly, this was converted into 'positive' (scores above 0), 'negative' (scores less than 0) and 'neutral' (scores equal to 0) values. A manual look-through of the dataset, there are some tweets that appear to be miscategorized between the sentiment categories.

In [None]:
def model_accuracy(df_name, why_df):
    X_train, X_test, y_train, y_test = train_test_split(
        why_df['text'], 
        why_df['sentiment'], 
        test_size=0.2, 
        random_state=24)
    
    vectorizer = TfidfVectorizer(ngram_range=(1,3), stop_words='english')
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)
    
    #Logistic Regression
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    
    predictions = lr.predict(X_test)
    confusion_matrix(predictions,y_test)
    print(classification_report(predictions,y_test))
    
    accuracy = metrics.accuracy_score(predictions, y_test)
    print(str('For the ' + df_name + ' dataset, the accuracy for Logistic Regression with TfidfVectorizer is {:04.2f}'.format(accuracy*100))+'%')
    
    #Naive Bayes (Multinomial)
    mnb = MultinomialNB()
    mnb.fit(X_train,y_train)
    predictions_mnb = mnb.predict(X_test)
    confusion_matrix(predictions_mnb,y_test)
    print(classification_report(predictions_mnb,y_test))
    accuracy_mnb = metrics.accuracy_score(predictions_mnb, y_test)
    
    return print(str('For the ' + df_name + ' dataset, the accuracy for Multinomial Naive Bayes with TfidfVectorizer is {:04.2f}'.format(accuracy_mnb*100))+'%')

model_accuracy('whynyc', whynyc)
model_accuracy('whyla', whyla)