![](https://c4.wallpaperflare.com/wallpaper/410/494/431/racing-f1-car-formula-1-race-car-hd-wallpaper-preview.jpg)

<div class='alert alert-info'>
   
<p> - Formula One (also known as Formula 1 or F1) is the highest class of international auto racing for single-seater formula racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA). The World Drivers' Championship, which became the FIA Formula One World Championship in 1981, has been one of the premier forms of racing around the world since its inaugural season in 1950. The word formula in the name refers to the set of rules to which all participants' cars must conform. A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on both purpose-built circuits and closed public roads.<p><br>
<p> - The craze for F1 among the fans is astonishing, which has been creating quite a buzz in major social media platforms like Twitter. The dataset brings you such tweets posted with the #f1 hashtag.</p>
</div>

<div class='alert alert-info'>
    <h3><center>This notebook analyses the tweets with the trending #f1 hashtag. So grab your gloves and fasten your seatbelts and let's analyze the impact of f1 in social media platforms like Twitter</center></h3>
    </div>

![](https://i.redd.it/0awzn68sz1o01.gif)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0' role="tab" aria-controls="home" color=black><center>Quick navigation</center></h3>

* [1. Required Libraries](#1)
* [2. Dataset Quick Overview](#2)
* [3. Tweets EDA](#3)
* [4. Tweets text analysis](#4)   

    Kindly, Upvote the notebook!

<a id="1"></a>
<h2 style='background:gold; border:0; color:black'><center>Required Libraries</center><h2>

In [None]:
import numpy as np 
import pandas as pd 
import os
import itertools

#plots
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer

from PIL import Image
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
from nltk.util import ngrams


import re
from collections import Counter

import nltk
from nltk.corpus import stopwords

import requests
import json

import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})

import warnings
warnings.filterwarnings("ignore")

<a id="2"></a>
<h2 style='background:gold; border:0; color:black'><center>Dataset Quick Overview</center><h2>

## Let's get some basic information about the data!

In [None]:
f1=pd.read_csv('../input/formula-1-trending-tweets/F1_tweets.csv')
f1.info()

In [None]:
f1.shape

## Let's visualize some missing values!

In [None]:
import missingno as mno
mno.matrix(f1)

In [None]:
missed = pd.DataFrame()
missed['column'] = f1.columns

missed['percent'] = [round(100* f1[col].isnull().sum() / len(f1), 2) for col in f1.columns]
missed = missed.sort_values('percent',ascending=False)
missed = missed[missed['percent']>0]

fig = sns.barplot(
    x=missed['percent'], 
    y=missed["column"], 
    orientation='horizontal'
).set_title('Missed values percent for every column')

<a id="3"></a>
<h2 style='background:gold; border:0; color:black'><center>Tweets EDA</center><h2>

## Lets Visualize the top 20 users by number of tweets


In [None]:
ds = f1['user_name'].value_counts().reset_index()
ds.columns = ['user_name', 'tweets_count']
ds = ds.sort_values(['tweets_count'],ascending=False)
f1 = pd.merge(f1, ds, on='user_name')

fig = sns.barplot( 
    x=ds.head(20)["tweets_count"], 
    y=ds.head(20)["user_name"], 
    orientation='horizontal'
).set_title('Top 20 users by number of tweets') 



## Users created - yearwise 


In [None]:
f1['user_created'] = pd.to_datetime(f1['user_created'],infer_datetime_format=True,errors ='coerce')
f1['year_created'] = f1['user_created'].dt.year
data = f1.drop_duplicates(subset='user_name', keep="first")
data = data[data['year_created']>1970]
data = data['year_created'].value_counts().reset_index()
data.columns = ['year', 'number']

fig = sns.barplot( 
    x=data["year"], 
    y=data["number"], 
    orientation='vertical'
    #title='', 
).set_title('User created year by year')

## Top 20 Users location based on the number of tweets

In [None]:
ds = f1['user_location'].value_counts().reset_index()
ds.columns = ['user_location', 'count']
ds = ds[ds['user_location']!='NA']
ds = ds.sort_values(['count'],ascending=False)

fig = sns.barplot(
    
    x=ds.head(20)["count"], 
    y=ds.head(20)["user_location"], 
    orientation='horizontal'
).set_title('Top 20 user locations by number of tweets')

## Visualizing the number of tweets per location!!

In [None]:
from plotly.offline import init_notebook_mode, iplot
def pie_count(data, field, percent_limit, title):
    
    data[field] = data[field].fillna('NA')
    data = data[field].value_counts().to_frame()

    total = data[field].sum()
    data['percentage'] = 100 * data[field]/total    

    percent_limit = percent_limit
    otherdata = data[data['percentage'] < percent_limit] 
    others = otherdata['percentage'].sum()  
    maindata = data[data['percentage'] >= percent_limit]

    data = maindata
    other_label = "Others(<" + str(percent_limit) + "% each)"
    data.loc[other_label] = pd.Series({field:otherdata[field].sum()}) 
    
    labels = data.index.tolist()   
    datavals = data[field].tolist()
    
    trace=go.Pie(labels=labels,values=datavals)
    
    layout = go.Layout(
        title = title,
        height=600,
        width=600
        )
    
    fig = go.Figure(data=[trace], layout=layout)
    iplot(fig)
    
pie_count(f1, 'user_location', 0.5, 'Number of tweets per location')

## Top 10 user sources by number of tweets

In [None]:
ds = f1['source'].value_counts().reset_index()
ds.columns = ['source', 'count']
ds = ds.sort_values(['count'],ascending=False)

fig = sns.barplot(
    x=ds.head(10)["count"], 
    y=ds.head(10)["source"], 
    orientation='horizontal', 
    #title='Top 40 user sources by number of tweets', 
    #width=800, 
    #height=800
).set_title('Top 10 user sources by number of tweets')

## Total number of tweets for users and number of hashtags in every tweet

In [None]:
f1['hashtags'] = f1['hashtags'].fillna('[]')
f1['hashtags_count'] = f1['hashtags'].apply(lambda x: len(x.split(',')))
f1.loc[f1['hashtags'] == '[]', 'hashtags_count'] = 0
fig = sns.scatterplot( 
    x=f1['hashtags_count'], 
    y=f1['tweets_count']
).set_title('Total number of tweets for users and number of hashtags in every tweet')


* users who post 100 tweets use a range of 1 to a maximum of 33 hastags!

## Number of hashtags used in each tweet

In [None]:
ds = f1['hashtags_count'].value_counts().reset_index()
ds.columns = ['hashtags_count', 'count']
ds = ds.sort_values(['count'],ascending=False)
ds['hashtags_count'] = ds['hashtags_count'].astype(str) + ' tags'
fig = sns.barplot( 
    x=ds["count"], 
    y=ds["hashtags_count"], 
    orientation='horizontal'
).set_title('Distribution of number of hashtags in tweets')

* Most users use 2 hastag followed by 1 hashtag
* Very less amount of people use more than 5 hashtags in their post 

## Number of unqiue users each day!

In [None]:
f1['date'] = pd.to_datetime(f1['date'],infer_datetime_format=True,errors ='coerce') 
df = f1.sort_values(['date'])
df['day'] = df['date'].astype(str).str.split(' ', expand=True)[0]
df['time'] = df['date'].astype(str).str.split(' ', expand=True)[1]
df.head()

ds = df.groupby(['day', 'user_name'])['hashtags_count'].count().reset_index()
ds = ds.groupby(['day'])['user_name'].count().reset_index()
ds.columns = ['day', 'number_of_users']
ds['day'] = ds['day'].astype(str)
fig = sns.barplot( 
    x=ds['day'], 
    y=ds["number_of_users"], 
    orientation='vertical',
    #title='Number of unique users per day', 
    #width=800, 
    #height=800
).set_title('Number of unique users per day')
#fig.show()
plt.xticks(rotation=90)

## Tweets distribution over days present in dataset

In [None]:
ds = df['day'].value_counts().reset_index()
ds.columns = ['day', 'count']
ds = ds.sort_values('count',ascending=False)
ds['day'] = ds['day'].astype(str)
fig = sns.barplot( 
    x=ds['count'], 
    y=ds["day"], 
    orientation='horizontal',
).set_title('Tweets distribution over days present in dataset')

## Tweets per day

In [None]:
f1['tweet_date']=f1['date'].dt.date
tweet_date=f1['tweet_date'].value_counts().to_frame().reset_index().rename(columns={'index':'date','tweet_date':'count'})
tweet_date['date']=pd.to_datetime(tweet_date['date'],infer_datetime_format=True,errors ='coerce')
tweet_date=tweet_date.sort_values('date',ascending=False)

fig=go.Figure(go.Scatter(x=tweet_date['date'],
                                y=tweet_date['count'],
                               mode='markers+lines',
                               name="Submissions",
                               marker_color='dodgerblue'))

f1_dummy=f1.dropna(subset=['tweet_date'])

fig.update_layout(
    title_text='Tweets per Day : ({} - {})'.format(f1_dummy['tweet_date'].sort_values()[0].strftime("%d/%m/%Y"),
                                                       f1_dummy['tweet_date'].sort_values().iloc[-1].strftime("%d/%m/%Y")),template="plotly_dark",
    title_x=0.5)

fig.show()

## Tweet distribution - hourly

In [None]:
f1['hour'] = f1['date'].dt.hour
ds = f1['hour'].value_counts().reset_index()
ds.columns = ['hour', 'count']
ds['hour'] = 'Hour ' + ds['hour'].astype(str)
fig = sns.barplot( 
    x=ds["hour"], 
    y=ds["count"], 
    orientation='vertical', 
).set_title('Tweets distribution over hours')
plt.xticks(rotation='vertical')


### Top 10 hastags used in the tweet!

In [None]:
def split_hashtags(x): 
    return str(x).replace('[', '').replace(']', '').split(',')

tweets_df = f1.copy()
tweets_df['hashtag'] = tweets_df['hashtags'].apply(lambda row : split_hashtags(row))
tweets_df = tweets_df.explode('hashtag')
tweets_df['hashtag'] = tweets_df['hashtag'].astype(str).str.lower().str.replace("'", '').str.replace(" ", '')
tweets_df.loc[tweets_df['hashtag']=='', 'hashtag'] = 'NO HASHTAG'
#tweets_df

ds = tweets_df['hashtag'].value_counts().reset_index()
ds.columns = ['hashtag', 'count']
ds = ds.sort_values(['count'],ascending=False)
fig = sns.barplot(
    x=ds.head(10)["count"], 
    y=ds.head(10)['hashtag'], 
    orientation='horizontal', 
    #title='Top 20 hashtags', 
    #width=800, 
    #height=700
).set_title('Top 10 hashtags')
#fig.show()

<a id="4"></a>
<h2 style='background:gold; border:0; color:black'><center>Tweets text analysis</center><h2>

## Prevalent words in tweets 

In [None]:
def build_wordcloud(df, title):
    wordcloud = WordCloud(
        background_color='black',colormap="Oranges", 
        stopwords=set(STOPWORDS), 
        max_words=50, 
        max_font_size=40, 
        random_state=666
    ).generate(str(df))

    fig = plt.figure(1, figsize=(14,14))
    plt.axis('off')
    fig.suptitle(title, fontsize=16)
    fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
build_wordcloud(f1['text'], 'Prevalent words in tweets for all dataset')

## Prevalent words in tweets from India

In [None]:
india_df = f1.loc[(f1.user_location=="United Kingdom")|(f1.user_location=="London, England")]
build_wordcloud(india_df['text'], title = 'Prevalent words in tweets from UK')

In [None]:
india_df = f1.loc[f1.user_location=="Paris"]
build_wordcloud(india_df['text'], title = 'Prevalent words in tweets from Paris')

In [None]:
india_df = f1.loc[f1.user_location=="India"]
build_wordcloud(india_df['text'], title = 'Prevalent words in tweets from India')

## Refining the text (Important step)

In [None]:
def remove_tag(string):
    text=re.sub('<.*?>','',string)
    return text
def remove_mention(text):
    line=re.sub(r'@\w+','',text)
    return line
def remove_hash(text):
    line=re.sub(r'#\w+','',text)
    return line

def remove_newline(string):
    text=re.sub('\n','',string)
    return text
def remove_url(string): 
    text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',string)
    return text
def remove_number(text):
    line=re.sub(r'[0-9]+','',text)
    return line
def remove_punct(text):
    line = re.sub(r'[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*','',text)
    return line
def text_strip(string):
    line=re.sub('\s{2,}', ' ', string.strip())
    return line
def remove_thi_amp_ha_words(string):
    line=re.sub(r'\bamp\b|\bthi\b|\bha\b',' ',string)
    return line

In [None]:
f1['refine_text']=f1['text'].str.lower()
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_tag(str(x)))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_mention(str(x)))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_hash(str(x)))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_newline(x))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_url(x))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_number(x))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_punct(x))
f1['refine_text']=f1['refine_text'].apply(lambda x:remove_thi_amp_ha_words(x))
f1['refine_text']=f1['refine_text'].apply(lambda x:text_strip(x))

f1['text_length']=f1['refine_text'].str.split().map(lambda x: len(x))

## The average length for a f1 Tweet using violin plot

In [None]:
fig = go.Figure(data=go.Violin(y=f1['text_length'], box_visible=True, line_color='black',
                               meanline_visible=True, fillcolor='royalblue', opacity=0.6,
                               x0='Tweet Text Length'))

fig.update_layout(yaxis_zeroline=False,title="Distribution of Text length",template='ggplot2')
fig.show()

* Average length of the f12020 tweet: 14.36
* Median length of the f1 2020 tweet:11
* Interquartile lie between : 6 and 19
* Min: 1
* Max: 58

## N-GRAM

## Listing below the top N-gram sequential words used in f1 tweets

In [None]:
def ngram_df(corpus,nrange,n=None):
    vec = CountVectorizer(stop_words = 'english',ngram_range=nrange).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    total_list=words_freq[:n]
    df=pd.DataFrame(total_list,columns=['text','count'])
    return df
unigram_df=ngram_df(f1['refine_text'],(1,1),20)
bigram_df=ngram_df(f1['refine_text'],(2,2),20)
trigram_df=ngram_df(f1['refine_text'],(3,3),20)

In [None]:
fig = make_subplots(
    rows=3, cols=1,subplot_titles=("Unigram","Bigram",'Trigram'),
    specs=[[{"type": "scatter"}],
           [{"type": "scatter"}],
           [{"type": "scatter"}]
          ])

fig.add_trace(go.Bar(
    y=unigram_df['text'][::-1],
    x=unigram_df['count'][::-1],
    marker={'color': "blue"},  
    text=unigram_df['count'],
    textposition = "outside",
    orientation="h",
    name="Months",
),row=1,col=1)

fig.add_trace(go.Bar(
    y=bigram_df['text'][::-1],
    x=bigram_df['count'][::-1],
    marker={'color': "blue"},  
    text=bigram_df['count'],
     name="Days",
    textposition = "outside",
    orientation="h",
),row=2,col=1)

fig.add_trace(go.Bar(
    y=trigram_df['text'][::-1],
    x=trigram_df['count'][::-1],
    marker={'color': "blue"},  
    text=trigram_df['count'],
     name="Days",
    orientation="h",
    textposition = "outside",
),row=3,col=1)

fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Top N Grams',xaxis_title=" ",yaxis_title=" ",
                  showlegend=False,title_x=0.5,height=1200,template="plotly_dark")
fig.show()

* race,hamilton, bottas are the most used unigrams
* red bull,grand prix, and lewis hamilton are the most used bigrams in f1 tweets. 
* Hungaraian grand prix is the most used trigrams!

<h2 style='background:black; border:0; color:gold'><center>Kindly upvote the notebook and the dataset!</center><h2>

[Dataset link with around 50k+ f1 tweets](https://www.kaggle.com/kaushiksuresh147/formula-1-trending-tweets)

### Resources:

1. https://www.kaggle.com/raenish/covid19-tweets-eda-prediction/log
2. https://www.kaggle.com/isaienkov/covid19-eda-animated-geographical-distribution

<a id="4"></a>
<h2 style='background:gold; border:0; color:black'><center>Kindly, Upvote if you find this notebook useful! Cheers!</center><h2>