# Overview 

## Dataset consists of 14 columns :

* 'id': Id of the tweet posted
* 'created_at': Date and Time of the tweet posted
* 'retweet_count': Count of how many times the same tweet is re-tweeted.
* 'source': From which platform the tweet was posted
* 'user_id': Id of the user posting the tweet
* 'user_name': Name of the user posting the tweet
* 'user_description': Description of the user posting the tweet
* 'userfollowercount': Count of how many followers does the user have
* 'userfriendscount': Count of how many friends does the user have
* 'user_location': Location from where the user posted the tweet
* 'user_verified': Is the user verified by Twitter or not
* 'user_url': URL of the user's profile
* 'tweet': Tweet posted by user
* 'lengthoftweet': The total length of the tweet posted by the user ( words ).

## Steps I used in this kernel :
> ### 1.Import libraries
> ### 2.Read Files & Basic insights
> ### 3.Preprocessing And Analysis
> ### 4.Data Visualization
> ### 5.Conclusion

# 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 2. Read Files & Basic insights

## 2.1 Read CSV Files 

In [None]:
data= pd.read_csv("../input/neet-tweets-dataset/neet_data.csv")

* First 5 Rows

In [None]:
data.head()

* Last 5 Rows

In [None]:
data.tail()

* Random 5 Rows

In [None]:
data.sample(5)

## 2.2 Basic Insights

In [None]:
print("Shape of the dataset : ", data.shape)

In [None]:
print("Column Names : \n"+'-'*25)
print(data.columns)

In [None]:
print("Unique values in every column \n"+'-'*25)
for i in data.columns:
    print("\t"+i+" = ",len(set(data[i])))

In [None]:
data.describe()

In [None]:
data.info()

# 3. Preprocessing & Analysis

## 3.1 Clearing Null Values

In [None]:
data.isnull().sum()

In [None]:
nullCount = ((data.isna().sum() / data.shape[0])* 100).reset_index().rename(columns = {"index": "Columns", 0: "missing value percentage"})
fig,axes = plt.subplots(1,2,figsize=(14,5))
plt.suptitle("Missing Value percentage",fontsize=18)
sns.heatmap(data.isna(),ax=axes[0])
sns.barplot(nullCount['Columns'],nullCount['missing value percentage'],ax=axes[1])
plt.xticks(rotation=90)
plt.show()

### INFERENCE : 
#### user_url(\~61%) has the most number of missing values followed by user_location(\~36%) and user_description(\~19%)

## 3.2 Imputation And Arranging

In [None]:
# replace nan of user_location with INDIA
data['user_location'].fillna('India',inplace=True)
# replace nan of user_description with NO DESCRIPTION
data['user_description'].fillna('No Description',inplace=True)
#check for all null values
data.isnull().sum()

In [None]:
data.boxplot()
plt.xticks(rotation=90)
plt.show()

* So we do not have much outliers in our data.

In [None]:
# splitting date and time
df = data
date=[]
time=[]
for i in data['created_at']:
    date.append(i.split(' ')[0])
    time.append(i.split(' ')[1])
df['created_on']=date
df['created_at']=time
df.head(3)

#### Lets drop user_url as it has many missing values and also not very useful

In [None]:
try:
    data.drop('user_url',axis=1,inplace=True)
except:
    print("URL dropped")

* Now lets have a look on hashtags and the persons tagged in the tweet.

In [None]:
hashtags = []
hashtags_count = []
person_tags = []
person_tags_count = []
for sen in data['tweet']:
    hashes = []
    tags = []
    sen_list = sen.split(' ')
    for word in sen_list:
        if len(word)>1:
            if word[0]=='#':
                hashes.append(word)
            if word[0]=='@':
                tags.append(word)
    hashtags.append(tuple(hashes))#converted to tuple as tuple is a hashable object
    person_tags.append(tuple(tags))
    hashtags_count.append(len(hashes))
    person_tags_count.append(len(tags))
      
len(person_tags),len(hashtags),len(hashtags_count),len(person_tags_count)

In [None]:
df['tagged_persons'] = tuple(person_tags)
df['hashtags'] = tuple(hashtags)
df['hashtags_count'] = hashtags_count
df['tagged_persons_count'] = person_tags_count
df.head(5)

In [None]:
print("Our dataset has {} persons tagged".format(df['tagged_persons_count'].sum()))
print("In our dataset users used {} hashtags ".format(df['hashtags_count'].sum()))

In [None]:
df = df[[ 'user_id', 'user_name','user_description', 'user_follower_count', 'user_friends_count',
              'user_location', 'user_verified', 'tweet', 'length_of_tweet', 'retweet_count', 'source',
              'created_at',  'created_on', 'tagged_persons', 'hashtags', 'hashtags_count',
              'tagged_persons_count']]

In [None]:
df.head(3)

In [None]:
print("The new shape of our Data is : ",df.shape)

# 4. Data Visualization 

In [None]:
hashData = df.hashtags.value_counts()[1:8].reset_index()
fig,axes = plt.subplots(1,1,figsize=(14,5))
plt.suptitle("Trending Hashtags Used",fontsize=18)
sns.barplot(data = hashData , y='index',x='hashtags')
plt.show()

In [None]:
tagData = df.tagged_persons.value_counts()[1:8].reset_index()
fig,axes = plt.subplots(1,1,figsize=(14,5))
plt.suptitle("Most Tagged Persons",fontsize=18)
sns.barplot(data = tagData , y='index',x='tagged_persons')
plt.show()

In [None]:
df = data.user_location.value_counts()[:3].reset_index()
fig,axes = plt.subplots(1,2,figsize=(14,5))
plt.suptitle("Most Common Locations ",fontsize=18)
sns.lineplot(x=df["index"], y = df["user_location"],ax=axes[1]) 
sns.barplot(y=df["index"], x = df["user_location"],ax=axes[0]) 
plt.xticks(rotation=90)
plt.show()

### Inference :-
#### Most of the users are from India and preffered not to provide more detail about their location. 

In [None]:
fig,axes = plt.subplots(1,2,figsize=(14,5))
plt.suptitle("Verified Users ",fontsize=18)
explode = (0.4, 0)
sns.countplot(data["user_verified"],ax=axes[1])
data['user_verified'].value_counts().plot.pie(explode=explode,shadow=True, startangle=90,ax=axes[0])
plt.show()

### Inference :- 
#### Very few users who are posting on #NEET are verified by Twitter.  

In [None]:
retweeted=[]
for i in data.retweet_count:
    if i>0:
        retweeted.append('Retweeted')
    else:
        retweeted.append('Not Retweeted')

retweeted=pd.Series(retweeted)
uniq = data.retweet_count.unique()
uniq

In [None]:
fig,axes = plt.subplots(1,2,figsize=(15,5))
plt.suptitle(" Retweeted Counts ",fontsize=18)
retweeted.value_counts().plot.pie(explode=(0.2,0),shadow=True, startangle=90,ax=axes[0])
plt.pie(data.retweet_count.value_counts(),startangle=30, shadow=True)
plt.show()

In [None]:
df = data.source.value_counts()[:7].reset_index()
fig,axes = plt.subplots(1,2,figsize=(14,5))
plt.suptitle("Common Sources Used by Users ",fontsize=18)
sns.barplot(y=df["index"], x = df["source"],ax=axes[0]) 
sns.lineplot(x=df["index"], y = df["source"],ax=axes[1]) 
plt.xticks(rotation=90)
plt.show()

### Inference :-
#### A major part of sources of posting tweet is from Android followed by Twitter Web App and then the rest.


# 5. CONCLUSION
*  The data do not have much outliers.
*  Users prefer to leave their descriptive information as the data has a lot of missing values in some columns(descriptive columns)
*  Most of the users are from India and a few from other countries too.
*  Most of the users posting on #NEET are not verified by Twitter.
*  People prefer posting from Android devices followed by Twitter Web App and then the rest applications.
*  The tweets posted has 203 persons tagged and users used 62 hashtags. 
*  Most tagged person is @neet_gill and most used hashtag is NEET.

### References: 
#### https://www.kaggle.com/sudarshanpatil/ipl-tweets-eda
#### https://seaborn.pydata.org/     ,     https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

## If You Like The Kernel Do Not Forget To Upvote And Add your Comments. 
# THANK YOU :D