Hello, this is a small notebook/code to do a sentiment analysis and then plot a map of support for each presidential. It is based on my other code about the [Australian Elections with Tweets](https://www.kaggle.com/andreispurim/challenge-botando-pra-quebrar) (which yielded pretty good results). I just copied and adjusted the code but this time it didn't worked quite that well.

So, here's the idea: make a TFIDF/LinearSVC learn with the sentiment140's 160.000.000 tweets (which are positive or negative), with the model fitted, apply the model to the tweets about Biden and Trump and plot the map.

I'll promise that later I'll document the code a little better, but it should be clear enough to have some idea. If you'd like to use it and improve it, please go ahead.

Now, a few ideas on improving this code:
- The idea of using sentiment140 is quite sound, but instead of using LinearSVC you can use much **better models**.
- The clean_sentence function is designed to be fast, but it is not the **best cleaner**. It can be improved, specially considering a good chunk of the words in tweets are @mentions and #hashtags. It would be a great improvement to keep mentions and hashtags as different phrase entities. 
- **Removing stopwords?** I don't think it works well with so few words (The current accuracy in sentiment140 is 82%, without the stopwords it fell down to 77%) because it helps understand the text better.
- Maybe make Biden and Trump concatened in the same dataset. This might speed up things a little.
- Sentiment140 used to have 'neutral' feeling. If our model was fitted, maybe the map would look better.
- Speaking of the map, it would be also smart to 'normalize' the colors. To balance Trump's orange (which is plotted above Biden) I had to alter the alpha. Maybe the best would to make an average of each region.
- Also, remember the demographic distribution of the USA and remember that Twitter is used usually by younger, more liberal, college students than it is by other demographics. So that's why in the midwest there's so much support for Biden: the only people using twitter in those regions are young biden voters.


In [None]:
import matplotlib.pyplot
import seaborn
import pandas
import string
import numpy
import nltk
import time
import gc
%matplotlib inline

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from mpl_toolkits.basemap import Basemap
from sklearn.svm import LinearSVC
from collections import Counter

def clear_sentence(sentence: str) -> str:
    sentence = sentence.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation)))
    sentence = sentence.lower()
    return sentence

def train_our_model_in_tweets():
    # Load the data and take a look at how tweet datasets usually look like
    Sentiments = pandas.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1", names=["target", "ids", "date", "flag", "user", "text"])

    # Now, we won't be using any other data other than the text and the sentiment. 
    Sentiments = Sentiments[['target','text']]

    # Make the sentiments strings
    sentiment_value = {0: "negative", 2: "neutral", 4: "positive"}
    decode = lambda label: sentiment_value[int(label)]
    x = Sentiments['text'].apply(clear_sentence).tolist()
    y = Sentiments['target'].apply(decode).tolist()

    # Let's use the SVC model we used before.
    starting_time = time.time()   
    vector = TfidfVectorizer(ngram_range=(1, 2))
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
    X_training = vector.fit_transform(X_train) 
    X_testing = vector.transform(X_test)
    model = LinearSVC()
    model.fit(X_training, y_train)
    y_prediction = model.predict(X_testing)
    accuracy = accuracy_score(y_test, y_prediction)
    ending_time = time.time()
    print('Trained our model in',len(Sentiments.index),'tweets')
    print('Accuracy:',"{:.2f}".format(accuracy*100) + " in {:.2f}s".format(ending_time-starting_time))
    return model,vector

def plot_support(model,vector):
    # Get the dataset
    Trump = pandas.read_csv('../input/us-election-2020-tweets/hashtag_donaldtrump.csv', lineterminator='\n')
    Biden = pandas.read_csv('../input/us-election-2020-tweets/hashtag_joebiden.csv', lineterminator='\n')
    Trump = Trump[['tweet','lat','long']]
    Biden = Biden[['tweet','lat','long']]

    # And let's clean our reviews
    Trump['tweet'] = Trump['tweet'].dropna().apply(clear_sentence)
    Biden['tweet'] = Biden['tweet'].dropna().apply(clear_sentence)

    # A function to get only the data inside the USA (there are many tweets from abroad)
    def get_region(data, bot_lat, top_lat, left_lon, right_lon):
        top = data.lat <= top_lat
        bot = data.lat >= bot_lat
        left = data.long >= left_lon
        right = data.long <= right_lon
        index = top&bot&left&right 
        return data[index]

    Trump = get_region(Trump,24,50,-126,-65)
    Biden = get_region(Biden,24,50,-126,-65)

    trump_sentiment = pandas.DataFrame(model.predict(vector.transform(Trump['tweet'].tolist())),columns=['sentiment'])
    biden_sentiment = pandas.DataFrame(model.predict(vector.transform(Biden['tweet'].tolist())),columns=['sentiment'])

    Trump = pandas.concat([Trump.reset_index(drop=True), trump_sentiment], axis=1)
    Biden = pandas.concat([Biden.reset_index(drop=True), biden_sentiment], axis=1)

    trump_positive = Trump[Trump['sentiment'] == 'positive']
    biden_positive = Biden[Biden['sentiment'] == 'positive']

    Map = Basemap(llcrnrlat=24,urcrnrlat=50,llcrnrlon=-126,urcrnrlon=-65)
    matplotlib.pyplot.figure(figsize=(12,10))
    Map.bluemarble(alpha=0.6)

    seaborn.scatterplot(x='long', y='lat', data=biden_positive, linewidth=0, s=40, alpha=1, label='Support for Biden')
    seaborn.scatterplot(x='long', y='lat', data=trump_positive, linewidth=0, s=40, alpha=0.01, label='Support for Trump')

    matplotlib.pyplot.gca().get_legend().legendHandles[1].set_alpha(1)
    matplotlib.pyplot.title("Tweets positive towards presidential candidates")
    matplotlib.pyplot.show()
    
trained_model,fitted_vector = train_our_model_in_tweets()
plot_support(trained_model,fitted_vector)