Link to [Colab Notebook](https://colab.research.google.com/drive/17nwWFe478Lc0-xzrMW3lVnwCTCNja5vJ?usp=sharing)
Link to [GitHub Repo](https://github.com/vidyap-xgboost/DataScience-ML_Projects/blob/master/twitter_data_twint_sweetviz_texthero.ipynb)

Please don't forget to **Upvote** this notebook if you find it useful and learned something from this! That would really encourage me to keep writing more notebooks. Thank you!


# 0. Scraping Data using Twint

Let's collect data from twitter using [twint](https://github.com/twintproject/twint) library.

**Question 1:** Why are we using **twint** instead of **Twitter's Official API**?

**Ans:** Because twint requires no authentication, no API, and importantly no limits.

**For more ways to install this library, please visit the above mentioned link.**

```python
!pip3 install twint
```

```python
import twint

# Create a function to scrape a user's account.
def scrape_user():
	print ("Fetching Tweets")
	c = twint.Config()
	# choose username (optional)
	c.Username = input('Username: ') # I used a different account for this project. Changed the username to protect the user's privacy.
	# choose beginning time (narrow results)
	c.Since = input('Date (format: "%Y-%m-%d %H:%M:%S"): ')
	# no idea, but makes the csv format properly
	c.Store_csv = True
	# file name to be saved as
	c.Output = input('File name: ')
	twint.run.Search(c)
```

```python
# run the above function

scrape_user()
print('Scraping Done!')
```

# 1. Reading Data using Pandas

In [None]:
# pandas to read our csv file
import pandas as pd

In [None]:
# save the csv file into a dataframe 'df'
df = pd.read_csv('../input/elon-musk-tweets-2015-to-2020/elonmusk.csv',low_memory=False, parse_dates=[['date', 'time']])

In [None]:
# make a copy if you need so that the changes made in original df doesn't affect the copy
df_copy = df.copy(deep=True)

In [None]:
# check the whole df
display(df)

# check an overview of the df
display(df.info())

# gives out quick analysis, notice the max retweets_count and min retweets_count and so on
display(df.describe())

In [None]:
# I don't need these columns, so dropping them. You can keep them if you want.
drop_list = ['id','conversation_id','created_at','name','timezone','user_id','cashtags','place','quote_url','near','geo','source','user_rt_id','user_rt','retweet_id','retweet_date','translate','trans_src','trans_dest','video','retweet']
df = df.drop(columns=drop_list)

In [None]:
# have a look again.
display(df.info())

In [None]:
# just in case texthero cant remove URLs
df['tweet'] = df['tweet'].str.replace('http\S+|www.\S+', '',case=False)

In [None]:
df

# 2. Install and Import TextHero

Here, we will be using [TextHero](https://github.com/jbesomi/texthero), a python package to work efficiently and quickly with text data. You can think of texhero as scikit-learn for text-based dataset.

**Question 2:** Why are we using TextHero instead of doing it from scratch using libraries like Gensim or other tools?

**Ans:** TextHero automates the cleaning process with one method, which is pretty effective. If the text needs to be cleaned further, we can do so by manually writing code to remove those unwanted words. In the backend, it uses libraries like Spacy, Gensim, tqdm,regex, nltk. So you don't have to import all those separately when you use TextHero.

In [None]:
# Check the above link for other installation instructions.

!pip install texthero

In [None]:
# import texthero

import texthero as hero

# 3. TextHero for quick cleaning of **raw** text data.

In [None]:
# let's do text preprocessing
from texthero import preprocessing

# creating a custom pipeline to preprocess the raw text we have
custom_pipeline = [preprocessing.fillna
                   , preprocessing.lowercase
                  #  , preprocessing.remove_digits # you can uncomment this if you want to remove digits as well.
                   , preprocessing.remove_punctuation
                   , preprocessing.remove_diacritics
                   , preprocessing.remove_stopwords
                   , preprocessing.remove_whitespace
                   , preprocessing.stem]

# simply call clean() method to clean the raw text in 'tweet' col and pass the custom_pipeline to pipeline argument
df['clean_tweet'] = hero.clean(df['tweet'], pipeline = custom_pipeline)

In [None]:
df

# 4. EDA and basic Visualization with Sweetviz

**Question 3:** Why are we using [Sweetviz](https://github.com/fbdesignpro/sweetviz) instead of matplotlib or plotly or bokeh for Exploratory Data Analysis?

**Ans**: Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. Output is a fully self-contained HTML application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

[See example report generated by sweetviz from the titanic dataset HERE
](http://cooltiming.com/SWEETVIZ_REPORT.html)


In [None]:
# Check the above link for other installation instructions

!pip3 install sweetviz

In [None]:
# importing sweetviz
import sweetviz as sv

In [None]:
# creating another dataframe df1 for further analysis.

df1 = df.drop(columns=['date_time'])

In [None]:
#to analyze the data and create a report, simply call analyze() method passing in the dataframe as argument

elonmusk_report = sv.analyze(df1)

In [None]:
#display the report as html

elonmusk_report.show_html('elonmusk.html')

A lot of information can be analyzed and understood from just one HTML Report before we do any further analysis.

Example Screenshots:

![Correlation Matrix](https://drive.google.com/uc?export=view&id=1-3_ZqGCJ6N_jrzZlt5aXxD5AGa_fyhGJ)

---

![Text Preview](https://drive.google.com/uc?export=view&id=1--gKvrfJ1jXPMn70VPZCk1BByLOUnN4S)



# 5. Convert timezone UTC to IST using pytz

This step can be avoided if you wish so, however, I would like to show you how you can convert UTC timezone to your local timezone in case you're doing timeseries analysis.

In [None]:
!pip3 install pytz

In [None]:
from datetime import datetime
from pytz import timezone

In [None]:
# In place of 'UTC', replace it with whatever the current timezone is in your df.
# In place of 'Asia/Kolkata', replace it with whatever timezone you want to convert into.

df['conv_datetime'] = df['date_time'].dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')

In [None]:
# I don't need the "+5.30" localize information in my df.

df['datetime'] = df['conv_datetime'].dt.tz_localize(None)

In [None]:
# dropping the extra columns and setting the datetime as index.

df = df.drop(columns=['date_time','conv_datetime'])

In [None]:
df = df.set_index('datetime')

# 6. Visualizations using TextHero for further insights

There are some pretty cool visualizations you can explore with TextHero

In [None]:
df

In [None]:
df1 = df.drop(columns=['tweet','username','link'])

In [None]:
import matplotlib.pyplot as plt

# using top_words() method, get the top N words and make a bar plot.
hero.top_words(df1['clean_tweet']).head(10).plot.bar(figsize=(15,10))
plt.show()

In [None]:
# Want to add more stop words to your list? No problem. Follow the below steps.

from texthero import stopwords
default_stopwords = stopwords.DEFAULT
#add a list of stopwords to the stopwords
stop_w = ["twitter","pic","com","yes","like","year","need","ok","exact","come soon","yeah",
          "yup","would","much","use"]
custom_stopwords = default_stopwords.union(set(stop_w))
#Call remove_stopwords and pass the custom_stopwords list
df1['clean_tweet'] = hero.remove_stopwords(df1['clean_tweet'], custom_stopwords)

In [None]:
# Let's visualize again.

hero.top_words(df1['clean_tweet']).head(10).plot.bar(figsize=(15,10))
plt.show()

In [None]:
# just checking for any null values
df1.clean_tweet.isna().sum()

In [None]:
# WordCloud with single line of code.

hero.visualization.wordcloud(df1['clean_tweet'],width = 400, height= 400,background_color='White')

In [None]:
#Add pca value to dataframe to use as visualization coordinates
df1['pca'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf)
            .pipe(hero.pca)
   )
#Add k-means cluster to dataframe 
df1['kmeans'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf)
            .pipe(hero.kmeans, n_clusters=5)
   )
df1.head()

In [None]:
# Generate scatter plot for pca and kmeans. Cool isn't it?
hero.scatterplot(df1, 'pca', color = 'kmeans', hover_data=['clean_tweet'] )

# 7. Other Visualizations for further analysis

In [None]:
!pip3 install chart-studio

In [None]:
import seaborn as sns # visualization library
import chart_studio.plotly as py # visualization library
from plotly.offline import init_notebook_mode, iplot # plotly offline mode
init_notebook_mode(connected=True) 
import plotly.graph_objs as go # plotly graphical object

In [None]:
df2 = df.drop(columns=['username','tweet','link'])

In [None]:
df2.head()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=df2['retweets_count'], dashes=False)
plt.title("Retweets over time")
plt.show()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=df2['replies_count'], dashes=False)
plt.title("Replies over time")
plt.show()

In [None]:
plt.figure(figsize=(17,10))
sns.lineplot(data=df2['likes_count'], dashes=False)
plt.title("Likes over time")
plt.show()

# What else?

You can check other libraries like [huggingface](https://github.com/huggingface) for NLP, [pendulum](https://github.com/sdispater/pendulum) if you're dealing with dates & time, and [Vaex](https://github.com/vaexio/vaex) if you're dealing with large datasets.

# Next Steps

- Build a Topic Model and check if you can categorize @elonmusk's tweets into different categories.
- Sentiment Analysis on his tweets
- How the sentiment is changing over time
- Take recent stocks data on TSLA and check if his tweets are influencing the TSLA stock or other stocks.

Please give a star for this repository if it helped you and raise issues if you find any. Thank you!

Link to [Colab Notebook](https://colab.research.google.com/drive/17nwWFe478Lc0-xzrMW3lVnwCTCNja5vJ?usp=sharing)
Link to [GitHub Repo](https://github.com/vidyap-xgboost/DataScience-ML_Projects/blob/master/twitter_data_twint_sweetviz_texthero.ipynb)

Please don't forget to **Upvote** this notebook if you find it useful and learned something from this! That would really encourage me to keep writing more notebooks. Thank you!