# Cleaning raw JSON tweets data scraped using snscrape library (codigo base utilizado del enunciado de la evaluación)

## 1. Importing required libraries

Let's start by importing the required libraries. We will be needing Pandas to load and work with JSON data as well as the json_normalize() function in the pandas.io.json package to perform some transformation functions on JSON data.

In [1]:
# Importing required libraries

import pandas as pd
from pandas.io.json import json_normalize
import warnings
warnings.filterwarnings("ignore")

## 2. Read raw JSON tweets data

Next, we load the raw JSON tweets data using the function read_json() available in pandas library. Since we are interested in performing analysis using techniques such as NLP, I have only retained tweets that are in the English language. Next, let's take a look at the first 5 records for the raw JSON data.

In [2]:
# Read JSON file containing tweets data and removce tweets not in English

raw_tweets = pd.read_json(r'farmers-protest-tweets-2021-03-5.json', lines=True)
raw_tweets = raw_tweets[raw_tweets['lang']=='en']
print("Shape: ", raw_tweets.shape)
raw_tweets.head(5)

Shape:  (417511, 21)


Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30 03:33:46+00:00,Support 👇\n\n#FarmersProtest,Support 👇\n\n#FarmersProtest,1376739399593910273,"{'username': 'ShashiRajbhar6', 'displayname': ...",[],[],0,0,...,0,1376739399593910273,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:33:23+00:00,Supporting farmers means supporting our countr...,Supporting farmers means supporting our countr...,1376739306287427584,"{'username': 'kaursuk06272818', 'displayname':...",[],[],0,0,...,0,1376739306287427584,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:31:00+00:00,Support farmers if you are related to food #St...,Support farmers if you are related to food #St...,1376738704128020488,"{'username': 'kaursuk06272818', 'displayname':...",[],[],0,0,...,0,1376738704128020488,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30 03:30:45+00:00,#StopHateAgainstFarmers support #FarmersProtes...,#StopHateAgainstFarmers support #FarmersProtes...,1376738640542400518,"{'username': 'SukhdevSingh_', 'displayname': '...",[],[],0,1,...,0,1376738640542400518,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30 03:30:30+00:00,"You hate farmers I hate you, \nif you love the...","You hate farmers I hate you, \nif you love the...",1376738579171344386,"{'username': 'Davidmu66668113', 'displayname':...",[],[],0,0,...,0,1376738579171344386,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,


## 3. Normalize 'user' field in raw_tweets

We see that 'raw_tweets' has a nested JSON field named 'user'. This field can be normalized for better analysis using the json_normalize() function in the pandas.io.json library. Essentially, semi-structured JSON data is "normalized" into a flat table.

For more info on how to use json_normalize(), check out [the documentation page for pandas.io.json.json_normalize()](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.io.json.json_normalize.html).

I have also renamed the fields 'id' to 'userId' and 'url' to 'profileUrl' for it to make more sense and avoid confusion. The fields 'description' and 'linkTcourl' are not important and hence, have been dropped.

Let's take a look at the first 5 records.

In [3]:
# Normalize 'user' field

users = json_normalize(raw_tweets['user'])
users.drop(['description', 'linkTcourl'], axis=1, inplace=True)
users.rename(columns={'id':'userId', 'url':'profileUrl'}, inplace=True)
users.head(5)

Unnamed: 0,username,displayname,userId,rawDescription,descriptionUrls,verified,created,followersCount,friendsCount,statusesCount,favouritesCount,listedCount,mediaCount,location,protected,linkUrl,profileImageUrl,profileBannerUrl,profileUrl
0,ShashiRajbhar6,Shashi Rajbhar,1015969769760096256,Satya presan 🤔ho Sakta but prajit💪 nhi\njhuth ...,[],False,2018-07-08T14:44:03+00:00,1788,1576,14396,26071,1,254,"Azm Uttar Pradesh, India",False,,https://pbs.twimg.com/profile_images/135433129...,https://pbs.twimg.com/profile_banners/10159697...,https://twitter.com/ShashiRajbhar6
1,kaursuk06272818,KAUR SUKH🌾ਕੌਰ ਸੁਖ,1332937272581263362,ਜਿਓਣਾ ਕੀ ਸਰੀਰਾਂ ਦਾ ਜੇਕਰ ਹੋਣ ਜ਼ਮੀਰਾਂ ਮਰੀਆਂ 🌼,[],False,2020-11-29T06:40:06+00:00,51,68,1338,3676,0,607,,False,,https://pbs.twimg.com/profile_images/133295149...,https://pbs.twimg.com/profile_banners/13329372...,https://twitter.com/kaursuk06272818
2,kaursuk06272818,KAUR SUKH🌾ਕੌਰ ਸੁਖ,1332937272581263362,ਜਿਓਣਾ ਕੀ ਸਰੀਰਾਂ ਦਾ ਜੇਕਰ ਹੋਣ ਜ਼ਮੀਰਾਂ ਮਰੀਆਂ 🌼,[],False,2020-11-29T06:40:06+00:00,51,68,1338,3676,0,607,,False,,https://pbs.twimg.com/profile_images/133295149...,https://pbs.twimg.com/profile_banners/13329372...,https://twitter.com/kaursuk06272818
3,SukhdevSingh_,Sukhdev Singh,1308356658582618112,Just a part of my society . Social and Politic...,[],False,2020-09-22T10:45:27+00:00,2595,3314,3281,3533,0,519,"Punjab, India",False,,https://pbs.twimg.com/profile_images/130835702...,https://pbs.twimg.com/profile_banners/13083566...,https://twitter.com/SukhdevSingh_
4,Davidmu66668113,tera jija 🤨🚩🇺🇸,1357311756532649985,dream boy 🌪🌍🔥💯,[],False,2021-02-04T12:55:36+00:00,18,286,347,520,0,3,,False,,https://pbs.twimg.com/profile_images/137600703...,https://pbs.twimg.com/profile_banners/13573117...,https://twitter.com/Davidmu66668113


## 4. Create users DF

Next, let's create the final DataFrame for Twitter users who tweeted using the hashtag "#FarmersProtest". I have also dropped duplicate records from the DataFrame based on the field 'userID' as each user must have a unique user ID.

Let's take a look at the shape and first 5 records for the final DataFrame for the Twitter users.

In [4]:
# Create DataFrame and remove duplicates

users = pd.DataFrame(users)
users.drop_duplicates(subset=['userId'], inplace=True)
print("Shape: ", users.shape)
users.head(5)

Shape:  (93223, 19)


Unnamed: 0,username,displayname,userId,rawDescription,descriptionUrls,verified,created,followersCount,friendsCount,statusesCount,favouritesCount,listedCount,mediaCount,location,protected,linkUrl,profileImageUrl,profileBannerUrl,profileUrl
0,ShashiRajbhar6,Shashi Rajbhar,1015969769760096256,Satya presan 🤔ho Sakta but prajit💪 nhi\njhuth ...,[],False,2018-07-08T14:44:03+00:00,1788,1576,14396,26071,1,254,"Azm Uttar Pradesh, India",False,,https://pbs.twimg.com/profile_images/135433129...,https://pbs.twimg.com/profile_banners/10159697...,https://twitter.com/ShashiRajbhar6
1,kaursuk06272818,KAUR SUKH🌾ਕੌਰ ਸੁਖ,1332937272581263362,ਜਿਓਣਾ ਕੀ ਸਰੀਰਾਂ ਦਾ ਜੇਕਰ ਹੋਣ ਜ਼ਮੀਰਾਂ ਮਰੀਆਂ 🌼,[],False,2020-11-29T06:40:06+00:00,51,68,1338,3676,0,607,,False,,https://pbs.twimg.com/profile_images/133295149...,https://pbs.twimg.com/profile_banners/13329372...,https://twitter.com/kaursuk06272818
3,SukhdevSingh_,Sukhdev Singh,1308356658582618112,Just a part of my society . Social and Politic...,[],False,2020-09-22T10:45:27+00:00,2595,3314,3281,3533,0,519,"Punjab, India",False,,https://pbs.twimg.com/profile_images/130835702...,https://pbs.twimg.com/profile_banners/13083566...,https://twitter.com/SukhdevSingh_
4,Davidmu66668113,tera jija 🤨🚩🇺🇸,1357311756532649985,dream boy 🌪🌍🔥💯,[],False,2021-02-04T12:55:36+00:00,18,286,347,520,0,3,,False,,https://pbs.twimg.com/profile_images/137600703...,https://pbs.twimg.com/profile_banners/13573117...,https://twitter.com/Davidmu66668113
5,Abhimanyu_1987,Abhimanyu 🌏 🇮🇳,2918610912,Seeker...,[],False,2014-12-04T13:29:54+00:00,173,41,8954,16364,19,112,"Jaipur,Rajasthan,India",False,,https://pbs.twimg.com/profile_images/125684524...,https://pbs.twimg.com/profile_banners/29186109...,https://twitter.com/Abhimanyu_1987


## 5. Create tweets DF

Next, we will transform the 'raw_tweets' DataFrame to obtain a DataFrame for tweets that contain the hashtag "#FarmersProtest". A new field, 'userId' is added which corresponds to the unique ID of the user who posted the particular tweet.

Next, I have retained only the important fields and renamed the fields 'id' to 'tweetId' and 'url' to 'tweetUrl' for it to make more sense and avoid confusion.

Let's take a look at the first 5 records of this DataFrame.

In [5]:
# Transform 'raw_tweets' DataFrame

# Add column for 'userId'
user_id = []
for user in raw_tweets['user']:
    uid = user['id']
    user_id.append(uid)
raw_tweets['userId'] = user_id

# Remove less important columns
cols = ['url', 'date', 'renderedContent', 'id', 'userId', 'replyCount', 'retweetCount', 'likeCount', 'quoteCount', 'source', 'media', 'retweetedTweet', 'quotedTweet', 'mentionedUsers']
tweets = raw_tweets[cols]
tweets.rename(columns={'id':'tweetId', 'url':'tweetUrl'}, inplace=True)
tweets.head(5)

Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30 03:33:46+00:00,Support 👇\n\n#FarmersProtest,1376739399593910273,1015969769760096256,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:33:23+00:00,Supporting farmers means supporting our countr...,1376739306287427584,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:31:00+00:00,Support farmers if you are related to food #St...,1376738704128020488,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30 03:30:45+00:00,#StopHateAgainstFarmers support #FarmersProtes...,1376738640542400518,1308356658582618112,0,1,3,0,"<a href=""http://twitter.com/download/android"" ...",,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30 03:30:30+00:00,"You hate farmers I hate you, \nif you love the...",1376738579171344386,1357311756532649985,0,0,1,0,"<a href=""http://twitter.com/download/android"" ...",,,,


Finally, I have created the final DataFrame for tweets that contain the hashtag "#FarmersProtest". Duplicate records are dropped from the DF based on the unique ID for each tweet (the field 'tweetId').

Let's take a look at the shape and first 5 records of the final tweets DataFrame.

In [6]:
# Convert to DataFrame, remove duplicates and keep only English tweets

tweets = pd.DataFrame(tweets)
tweets.drop_duplicates(subset=['tweetId'], inplace=True)
print("Shape: ", tweets.shape)
tweets.head(5)

Shape:  (417511, 14)


Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30 03:33:46+00:00,Support 👇\n\n#FarmersProtest,1376739399593910273,1015969769760096256,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:33:23+00:00,Supporting farmers means supporting our countr...,1376739306287427584,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:31:00+00:00,Support farmers if you are related to food #St...,1376738704128020488,1332937272581263362,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30 03:30:45+00:00,#StopHateAgainstFarmers support #FarmersProtes...,1376738640542400518,1308356658582618112,0,1,3,0,"<a href=""http://twitter.com/download/android"" ...",,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30 03:30:30+00:00,"You hate farmers I hate you, \nif you love the...",1376738579171344386,1357311756532649985,0,0,1,0,"<a href=""http://twitter.com/download/android"" ...",,,,


# 4 funciones Python

In [8]:
#llamar 4 funciones python
%run "f1.ipynb" #top 10 tweets + rt
#%run "f2.ipynb" #top 10 usuarios
#%run "f3.ipynb" #top 10 días
#%run "f4.ipynb" #top 10 hashtags

### Función 1: Los top 10 tweets más retweeted

In [9]:
#función 1
f1_top10(tweets)

Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
408128,https://twitter.com/rihanna/status/13566258896...,2021-02-02 15:29:51+00:00,why aren’t we talking about this?! #FarmersPro...,1356625889602199552,79293791,163065,315547,944307,45832,"<a href=""http://twitter.com/download/iphone"" r...",,,,
395142,https://twitter.com/GretaThunberg/status/13566...,2021-02-02 20:04:01+00:00,We stand in solidarity with the #FarmersProtes...,1356694884615340037,1006419421244678144,49793,103957,319363,13815,"<a href=""http://twitter.com/download/iphone"" r...",,,,
266196,https://twitter.com/GretaThunberg/status/13572...,2021-02-04 10:59:01+00:00,I still #StandWithFarmers and support their pe...,1357282507616645122,1006419421244678144,39596,67694,234676,10587,"<a href=""http://twitter.com/download/iphone"" r...",,,,
366579,https://twitter.com/miakhalifa/status/13568483...,2021-02-03 06:14:01+00:00,"“Paid actors,” huh? Quite the casting director...",1356848397899112448,2835653131,15569,35921,139959,5681,"<a href=""http://twitter.com/download/iphone"" r...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
372793,https://twitter.com/miakhalifa/status/13568277...,2021-02-03 04:51:48+00:00,What in the human rights violations is going o...,1356827705161879553,2835653131,9082,26972,99227,4606,"<a href=""http://twitter.com/download/iphone"" r...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
314192,https://twitter.com/TeamJuJu/status/1357048037...,2021-02-03 19:27:19+00:00,"Happy to share that I’ve donated $10,000 to pr...",1357048037302960129,733170759829327874,7683,23251,59248,4082,"<a href=""http://twitter.com/download/iphone"" r...",,,,
215034,https://twitter.com/BobBlackman/status/1357755...,2021-02-05 18:19:19+00:00,There has been much social media coverage arou...,1357755699162398720,805185025,1845,20132,42779,1592,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
398011,https://twitter.com/vanessa_vash/status/135668...,2021-02-02 19:09:23+00:00,Farmers feed the world. Fight for them. Protec...,1356681136655769605,1134059457191776257,1301,18744,67986,820,"<a href=""http://twitter.com/download/android"" ...",,,,
325261,https://twitter.com/kylekuzma/status/135700972...,2021-02-03 16:55:04+00:00,Should be talking about this! #FarmersProtest\...,1357009721090138112,272616327,4167,17368,39653,2505,"<a href=""http://twitter.com/download/iphone"" r...",,,,
163689,https://twitter.com/AmandaCerny/status/1359013...,2021-02-09 05:36:49+00:00,To all of my influencer/celeb friends- read up...,1359013362881994752,104856942,2028,15677,81375,813,"<a href=""http://twitter.com/download/iphone"" r...",,,,


### Función 2: Los top 10 usuarios en función a la cantidad de tweets que emitieron

In [None]:
#función 2

### Función 3: Los top 10 días donde hay más tweets

In [None]:
#función 3

### Función 4: Top 10 hashtags más usados

In [None]:
#función 4