<a href="https://colab.research.google.com/github/sucharith-p/ColabNotebooks/blob/master/Text_Analysis_Tweets_from_Elon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Analysis - Tweets from Elon (Years 2021 and 2022) 


---


##Introduction

The objective of this project is to analyse words and their frequencies along with most common word associations found in Elon Musk's twitter feed. 

**Text Analysis** is the process of using computer systems to read and understand human-written text for business insights. You can use text analysis to efficiently and accurately process multiple text-based sources such as emails, documents, social media content, and product reviews, like a human would.

In this project, i conclude my analysis with a bigram made of common word associations Elon has consistently tweeted throughtout the year.

##Next Steps:

###1. Data Retrieval

The data is available at https://www.kaggle.com/datasets/ayhmrba/elon-musk-tweets-2010-2021. I have chosen the years 2021 and 2022 for my analysis. Other years' data can also be included for further operations like sentiment analysis and text mining. 

###2. Data Preprocessing
The raw data must be pre-processed in order to remove/impute null and faulty values. All outliers must be validated and any other anomalies should be handled in this step. I have also removed stop words (Words that add grammatical sense rather than conveying important information).


###3. Checking the distribution of data
We plot a simple histogram for the word frequency and check the nature of the distribution.


###4. Visualizing the data
We create a bigram dataframe and create a network graph of most common words that we evaluated using the nltk package.



---


###Steps to Mount your data onto the notebook: 
You can either upload the data directly onto the runtime's session data or mount your google drive, after which you can directly upload the file to your drive and access it from the notebook.

In [1]:
#Mounting my drive onto the notebook

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#Navigating to my folder where i have uploaded the files 

%cd /content/drive/My Drive

/content/drive/My Drive


In [18]:
#Import Statements

import pandas as pd
import nltk 
import re
import dateutil 
import matplotlib.pyplot as plt
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk import bigrams
import itertools
import networkx as nx
import collections
from nltk.probability import FreqDist
import plotly.express as px
import plotly.graph_objects as go

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
#Read the data as dataframe 

df1 = pd.read_csv('2021.csv')
df2 = pd.read_csv('2022.csv')
df = pd.concat([df1, df2])

In [5]:
#Functions for Data Pre-processing

def preprocess(tw):

    stop_words = set(stopwords.words('english'))
    #Removing any account references and symbols or hashtags
    tw = re.sub("[@&#][A-Za-z0-9_]+"," ", tw.lower())
    
    
    #Removing hyperlinks and web links
    tw = re.sub(r'http\S+', ' ', tw)
    tw = re.sub(r"www.\S+", " ", tw)
    
    #Removing Punctuation
    tw = re.sub('[()!?]', ' ', tw)
    tw = re.sub('\[.*?\]',' ', tw)
    
    #Removing any other special characters
    tw = re.sub("[^a-z0-9]"," ", tw)
    
    #Removing numbers
    tw = re.sub(r'[0-9]+', ' ', tw)
    
    #Word tokenization and removing stop words
    tokens = word_tokenize(tw)
    filtered_sentence = [w for w in tokens if not w in stop_words]
    
    if len(filtered_sentence)==0:
        return float('NaN')
    else:
        return ' '.join(filtered_sentence)

#Function to format date to year
def format_year(date):
    new_date = dateutil.parser.parse(date)
    return new_date.year

#Frequency Distribution
def frequency_dist(document):
    return nltk.FreqDist(document)

#Tokenization
def tokenize(string):
    return word_tokenize(string)

In [6]:
#Applying Data pre-processing functions to the dataframe

df['tweet'] = df['tweet'].apply(preprocess)

df['year'] = df['date'].apply(format_year)

df = df[df['tweet'].notna()]

df 

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,year
1,1476656306610216960,1476644467578859528,2021-12-31 00:47:53 Arabian Standard Time,2021-12-31,00:47:53,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'tesla_raj', 'name': 'Tesla R...",,,,,2021
2,1476651519986614281,1476252898115964928,2021-12-31 00:28:51 Arabian Standard Time,2021-12-31,00:28:51,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'CSmithson80', 'name': 'Chris...",,,,,2021
3,1476619907076923398,1476252898115964928,2021-12-30 22:23:14 Arabian Standard Time,2021-12-30,22:23:14,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'BLKMDL3', 'name': 'Zack', 'i...",,,,,2021
4,1476618021024190474,1476252898115964928,2021-12-30 22:15:45 Arabian Standard Time,2021-12-30,22:15:45,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'mims', 'name': 'Christopher ...",,,,,2021
6,1476473842059161602,1476437555717541893,2021-12-30 12:42:50 Arabian Standard Time,2021-12-30,12:42:50,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'T_Ball5', 'name': 'TBALL5', ...",,,,,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,1478074724198633476,1478069658687266818,2022-01-03 22:44:10 Arabian Standard Time,2022-01-03,22:44:10,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'jack', 'name': 'jack⚡️', 'id...",,,,,2022
1021,1478071525660147720,1477989974008020992,2022-01-03 22:31:27 Arabian Standard Time,2022-01-03,22:31:27,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'ClaudioOmbrella', 'name': 'C...",,,,,2022
1022,1477836846218592259,1477636317831966723,2022-01-03 06:58:55 Arabian Standard Time,2022-01-03,06:58:55,400,44196397,elonmusk,Elon Musk,,...,,,,,"[{'screen_name': 'auren', 'name': 'Auren 𝐇𝐨𝐟𝐟𝐦...",,,,,2022
1024,1477706142461706248,1477706142461706248,2022-01-02 22:19:33 Arabian Standard Time,2022-01-02,22:19:33,400,44196397,elonmusk,Elon Musk,,...,,,,,[],,,,,2022


In [7]:
#Removing all other irrelavant columns

final_df = df.copy(deep = True)
final_df = final_df[['tweet', 'year']]
final_df 

Unnamed: 0,tweet,year
1,many ui improvements coming,2021
2,chart big deal,2021
3,predicting macroeconomics challenging say leas...,2021
4,history guide many make past next recession,2021
6,probably wrong,2021
...,...,...
1020,reminds hex edited ultima v get final maze,2022
1021,yay switzerland,2022
1022,way touch voters three generations away voting...,2022
1024,let make roaring happen,2022


In [8]:
#Top 10 Words for year 2021

freq_2021= frequency_dist(word_tokenize(' '.join(final_df[final_df['year']==2021]['tweet'])))
df_2021 = pd.DataFrame(freq_2021.items(),columns =['Word','Frequency']).sort_values(['Frequency'],ascending =False)
df_2021.head(10)

Unnamed: 0,Word,Frequency
48,tesla,213
65,great,102
143,good,98
111,much,93
156,would,89
96,haha,89
117,like,89
32,time,86
807,beta,77
927,high,74


In [9]:
#Top 10 Words for year 2022

freq_2022= frequency_dist(word_tokenize(' '.join(final_df[final_df['year']==2022]['tweet'])))
df_2022 = pd.DataFrame(freq_2022.items(),columns =['Word','Frequency']).sort_values(['Frequency'],ascending =False)
df_2022.head(10)

Unnamed: 0,Word,Frequency
63,tesla,62
54,people,34
60,would,32
107,yes,32
245,one,30
323,good,28
7,starlink,28
158,car,26
686,true,22
134,high,22


In [11]:
#Histogram for checking distribution of data for the year 2021

fig = px.histogram(df_2021, x="Frequency")
fig.show()

In [13]:
#Histogram for checking distribution of data for the year 2022

fig = px.histogram(df_2022, x="Frequency")
fig.show()

###Note - 
The final tokenize function is applied to the data to seperate words from sentences as tokens and evaluate their associations with other words. 

The words and their associations can be accessed by hovering over the nodes in the bigram plot 

In [15]:
#Tokenize tweets and add to dataframe

final_df['Tokens'] = final_df['tweet'].apply(tokenize)

In [118]:
#Create bigram dataframe

bigram_words = [list(bigrams(tweet)) for tweet in final_df[final_df['year']==2021]['Tokens']]
total_bigrams = list(itertools.chain(*bigram_words))
bigram_counts = collections.Counter(total_bigrams)
bigram_df = pd.DataFrame(bigram_counts.most_common(50), columns=['bigram', 'count'])
bigram_df

Unnamed: 0,bigram,count
0,"(fsd, beta)",21
1,"(long, term)",19
2,"(pure, vision)",16
3,"(life, multiplanetary)",16
4,"(self, driving)",16
5,"(next, week)",15
6,"(next, year)",14
7,"(super, heavy)",13
8,"(pretty, much)",13
9,"(supply, chain)",13


In [119]:
#Transfer data to a dictionary 

bigram_dictonary = bigram_df.set_index('bigram').T.to_dict('records')

In [120]:
#Initialize graph instance and initialize nodes and edges

G = nx.Graph()
for i in G.nodes():
  G.add_node(i)

for k, v in bigram_dictonary[0].items():
    G.add_edge(k[0], k[1])

pos = nx.shell_layout(G) 

for n, p in pos.items():
  G.nodes[n]['pos'] = p

node_text = []
for i in G.nodes():
  node_text.append(i)

In [124]:
#Create a Network graph using Plotly express

edge_x = []
edge_y = []

for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='#888'),
    hoverinfo='text',
    mode='lines')

node_x = []
node_y = []

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_x.append(x)
    node_y.append(y)

node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=False,
        # reversescale=True,
        color=[],
        size=10,
        line_width=2))

node_trace.text = node_text 

In [125]:
#Display the bigram 

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )
fig.show()

In [130]:
#Create bigram dataframe

bigram_words = [list(bigrams(tweet)) for tweet in final_df[final_df['year']==2022]['Tokens']]
total_bigrams = list(itertools.chain(*bigram_words))
bigram_counts = collections.Counter(total_bigrams)
bigram_df = pd.DataFrame(bigram_counts.most_common(50), columns=['bigram', 'count'])
bigram_df

Unnamed: 0,bigram,count
0,"(birth, rate)",8
1,"(model, x)",8
2,"(sustainable, energy)",6
3,"(work, tesla)",6
4,"(self, driving)",6
5,"(neural, nets)",6
6,"(last, year)",6
7,"(life, multiplanetary)",6
8,"(starlink, terminals)",4
9,"(news, sources)",4


In [131]:
#Transfer data to dictionary 

bigram_dictonary = bigram_df.set_index('bigram').T.to_dict('records')

In [132]:
#Initialize graph instance and initialize nodes and edges

G = nx.Graph()
for i in G.nodes():
  G.add_node(i)

for k, v in bigram_dictonary[0].items():
    G.add_edge(k[0], k[1])

pos = nx.shell_layout(G) 

for n, p in pos.items():
  G.nodes[n]['pos'] = p

node_text = []
for i in G.nodes():
  node_text.append(i)

In [133]:
#Create a Network graph using Plotly express

edge_x = []
edge_y = []

for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='#888'),
    hoverinfo='none',
    mode='lines')

node_x = []
node_y = []

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_x.append(x)
    node_y.append(y)

node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=False,
        color=[],
        size=10,
        line_width=2))

node_trace.text = node_text 

In [134]:
#Display the bigram 

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )
fig.show()

##Conclusion

From this analysis we have observed the most common words Elon's tweet contain in the years 2021 and 2022. We have also observed the most common word associations that Elon's tweets have showed.  