<h1><b>Twitter tweet analysis</b></h1>

<p1><b>In this note we would be investigating a set of Twitter tweets that uses the hashtag "COVID19" and plotting the location of the tweeters in a map. </b></p1>

In [None]:
import pandas as pd
import numpy as np
import folium
from geopy.geocoders import Nominatim
import json
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.pyplot as plt
from geopy.extra.rate_limiter import RateLimiter

In [None]:
tweets_df=pd.read_csv('../input/covid19-tweets/covid19_tweets.csv',header=0)

In [None]:
tweets_df.head()

<h2><b>Lets see if there are any null entries in the DataFrame</b></h2>

In [None]:
tweets_df.info()

<h2><b>Summary statistics</b></h2>

In [None]:
tweets_df.describe()

**Lets replace all the cells with no entries using np.NaN (if any)**

In [None]:
tweets_df.replace('',np.NaN)
tweets_df.info()

**Lets drop all NaN entries in the dataframe**

In [None]:
tweets_df.dropna(inplace=True)
tweets_df.info()

**List of tweets by various users**

In [None]:
tweets_df['user_name'].value_counts()

> **Lets plot the top 10 users based on tweet count**

In [None]:
top_users=tweets_df.groupby('user_name')['user_location'].count().reset_index()
top_users.columns=['user_name','count']
top_users.sort_values('count',ascending=False,inplace=True)
top_users[0:10].plot(kind='bar',x='user_name',y='count')
plt.xlabel('Users')
plt.ylabel('Tweets')
plt.title('Top 10 tweeters')
plt.show()

<h2><b>Removing all non alphanumeric characters from the DataFrame (this would be helpful while plotting user location on a map)</b></h2>

In [None]:
locations=tweets_df['user_location'].replace('[^a-zA-Z0-9 ]', '', regex=True)
tweets_df['user_location']=locations

In [None]:
tweets_df.reset_index(inplace=True,drop=True)
tweets_df.head()

In [None]:
tweets_df.dropna(inplace=True)
tweets_df.info()

<p1><b>The below cell is commented as its time consuming process to get the latitude and longitude of each location present in this data set. If suppose we can get this, we can then use Folium to plot the tweet density distribution on a world map to make this more interactive</b><p1>

In [None]:
tweets_df['date']=pd.to_datetime(tweets_df['date'])

In [None]:
top_july = tweets_df['user_location'][pd.DatetimeIndex(tweets_df['date']).month == 7].value_counts()

In [None]:
top_august = tweets_df['user_location'][pd.DatetimeIndex(tweets_df['date']).month == 8].value_counts()
top_all_the_time = (top_august + top_july).sort_values(ascending = False)

<h2><b>Top places based on tweets in July</b></h2>

In [None]:
fig, ax = plt.subplots(figsize = (13,5))
plt.xlabel("Location", fontsize = 12)
plt.ylabel("NO. Tweets", fontsize = 12)
top_july[0:10].plot(kind='bar', title = "Top 10 Countries Posting about Covid-19 in July" )

<h2><b>Top places based on tweets in August</b></h2>

In [None]:
fig, ax=plt.subplots(figsize=(13,5))
plt.xlabel("Top Locations")
plt.ylabel("Tweet Count")
plt.title("Top 10 Countries Posting about Covid-19 in August")
top_august[0:10].plot(kind='bar')

In [None]:
a=tweets_df['source'].value_counts()

<h2><b>Top sources of tweets</b></h2>

In [None]:
a[0:5].plot(kind='bar')

In [None]:
tags=tweets_df['hashtags'].value_counts()

<h2><b>Top hashtags used</b></h2>

In [None]:
tags[0:10].plot(kind='bar')

In [None]:
user_status=tweets_df['user_verified'].value_counts()

In [None]:
user_status.plot(kind='bar')
plt.xlabel("Account verification status")
plt.ylabel("Number of tweets")
plt.title("Verified account tweets vs Unverified account tweets")
plt.xticks([False,True],['Unverified','Verified'])

<h2><b>Lets plot the location of the top 100 tweeters in a map</b></h2>

As same user has posted multiple tweets, there is a lot of duplicate entries with respect to the users who have tweeted in the DataFrame. Hence lets create a new dataframe which does not have any such duplicate values

In [None]:
tweets_df2=tweets_df.drop_duplicates(subset='user_name')
tweets_df2=tweets_df2[['user_name','user_location']]
tweets_df2.set_index('user_name',inplace=True)
tweets_df2.head()

In [None]:
to_plot_list=top_users['user_name']
to_plot_list=to_plot_list[0:100]
#to_plot_list.shape

In [None]:
to_plot_map=tweets_df2.loc[to_plot_list]
#to_plot_map.columns

In [None]:
#to_plot_map.shape

In [None]:
geolocator = Nominatim(user_agent="ny_explorer")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
longitude=[]
latitude=[]

#address=['chennai','dallas','hsjhgdgh','London, UK']

for i in to_plot_map['user_location'].astype(str):
    location = geolocator.geocode(i)
    if location:
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    else:
        latitude.append(np.NaN)
        longitude.append(np.NaN)
    
    
print(len(latitude)) 


In [None]:
world_map = folium.Map(zoom_start=14)


In [None]:
latitude=list(latitude)
longitude=list(longitude)

cleanedlatitude = [x for x in latitude if str(x) != 'nan']


In [None]:
cleanedlongitude = [x for x in longitude if str(x) != 'nan']

In [None]:
incident_tweets=folium.map.FeatureGroup()

for i, j in zip(cleanedlatitude, cleanedlongitude):
    incident_tweets.add_child(
        folium.CircleMarker(
            [i, j],
            radius=5, # define how big you want the circle markers to be
            color='red',
            fill=True,
            fill_color='green',
            fill_opacity=0.5,
            
        
        )
    )
    
world_map.add_child(incident_tweets)

<h1><b>End of my basic analysis. Feel free to share your comments!!!!</b></h1>