# Assignment \#9 the Sentiment140 dataset

The sentiment140 dataset contains 1,600,000 tweets extracted using the twitter api from April to June 2009. The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment.

Source: kaggle.com

As the file is quite large (230MB), we use a subset of this dataset, where we selected only tweets containing the strings (not only isolated words) `"good"` or `"bad"`. The subset for this assignment contains 103,259 tweets:

- date: datetime of the tweet in the format: year-month-day hours:minutes:seconds
- score: polarity of the tweet: 0=negative and 4=positive
- user: user who produced the tweet
- text: text of the tweet

**Caution**: Questions asking to return a floating point number (ratio, mean, percentage) should round it to 1 decimal place:
- Such questions are marked with `(°)`
- For instance, if the variable `result` is a floating point number, e.g. `3.14159265359`
- The functions should return `round(result, 1)` instead of `result`, e.g. `3.1`
- Sommetimes, rounding a floating point number to 1 decimal place leads to a strange number such as `3.100000001`. This is a common outcome with floating point numbers representation and will not affect the grade.
- Percentages should be returned as floating point numbers (not with the % mark).

#### Questions

**A. In this part, we explore the time series aspect of the dataset**

1) Pick up the first datetime of the dataset and produce a string in the exact format `day/month/year`, example `"25/12/2020"`, using the `strftime()` method.

2) Pick up the last datetime of the dataset and produce a string in the exact format `month day, year`, example `"Dec 25, 2020"`, using the `strftime()` method.

3) How many days with at least one tweet do we have?

4) What is the maximum number of tweets in a day?

5) What is the minimum number of tweets in a day with at least one tweet?

6) What is the average number of tweets in a day with at least one tweet (°)?

7) If we consider only the hours of the day (0 to 23) where the tweets have been produced, at what hour do we have the minimum number of tweets?

8) If we consider only the hours of the day (0 to 23) where the tweets have been produced, at what hour do we have the maximum number of tweets?

9) Who is the most active user in the dataset?

10) What is the average number of tweets in a day produced by the most active user (°)?


(°) Result of functions should be rounded to 1 decimal place.

**B. In this part, we explore the textual aspect of the dataset**

11) What is the mean score of all tweets (°)?

12) What is the mean score of tweets containing the string `"good"` (°)?

13) What is the mean score of tweets containing the string `"bad"` (°)?

14) What is the mean score of tweets issued by the most active user found in question 9 (°)?

15) Which text is the most frequent one in all tweets?

16) How many different users issued the most frequent tweet?

17) What is the mean score of all tweets issued by those users (°)?

18) In tweets, users are quoted with a string starting with an `@` and then containing possibly uppercase and lowercase letters, digits and underscore `_`. Which user is the most quoted one (the result should be a string starting with an `@`)?

19) How many different users issued at least a tweet quoting the most quoted user?

20) What is the mean score of tweets quoting the most quoted user (°)?

(°) Result of functions should be rounded to 1 decimal place.

**Important import**

The cell below imports only the `pandas` module.

To achieve this assignment, you will need to import other modules. In order to avoid runtime errors when grading please import below the supplementary modules that you need.

Do not import the `locale` module. All written ouputs are supposed to be in English.

In [1]:
# import
import pandas as pd
import datetime
import re
from collections import Counter
# Import here the supplementary modules that you need
# import

In [2]:
# loading the data
# the dates are parsed!
df = pd.read_csv('sample1600000.csv', parse_dates=['date'])
df.head()

Unnamed: 0,date,score,user,text
0,2009-04-06 22:21:27,0,crosland_12,@cocomix04 ill tell ya the story later not a ...
1,2009-04-06 22:23:09,0,ericg622,I had such a nice day. Too bad the rain comes ...
2,2009-04-06 22:23:15,0,adri_mane,@Starrbby too bad I won't be around I lost my ...
3,2009-04-06 22:26:29,0,timdonnelly,"hey, I actually won one of my bracket pools! T..."
4,2009-04-06 22:26:33,0,willy_chaz,A bad nite for the favorite teams: Astros and ...


In [3]:
# Pick up the first datetime of the dataset and produce a string in the exact format "25/12/2019"
def exercise_01():
    string = df['date']
    result = string[0].strftime("%d/%m/%Y")
    return result

In [4]:
# run and check
exercise_01()

'06/04/2009'

In [5]:
# Pick up the last datetime of the dataset and produce a string in the exact format "Dec 25, 2019"
def exercise_02():
    column = df['date']
    result = column.iloc[-1].strftime("%b %d, %Y")
    return result

In [6]:
# run and check
exercise_02()

'Jun 16, 2009'

In [7]:
# How many days with at least one tweet do we have?
def exercise_03():
    df2 = df.set_index('date')
    result = (df2.resample('D').size() >= 1).sum()
    return result

In [8]:
# run and check
exercise_03()

48

In [9]:
# What is the maximum number of tweets in a day?
def exercise_04():
    df2 = df.set_index('date')
    result = df2.resample('D').size().max()
    return result

In [10]:
# run and check
exercise_04()

7309

In [11]:
# What is the minimum number of tweets in a day with at least one tweet?
def exercise_05():
    df2 = df.set_index('date')
    res = df2.resample('D').size()
    res2 = res[res != 0]
    result = res2.min()
    return result

In [12]:
# run and check
exercise_05()

10

In [13]:
# What is the average number of tweets in a day with at least one tweet (°)?
def exercise_06():
    df2 = df.set_index('date')
    res = df2.resample('D').size()
    res2 = res[res != 0]
    result = round(res2.mean(),1)
    return result

In [14]:
# run and check
exercise_06()

2151.2

In [15]:
# At what hour do we have the minimum number of tweets?
def exercise_07():
    r = df['date'].apply(lambda x : x.strftime("%H")).value_counts().idxmin()
    result = print(r) 
    return result

In [16]:
# run and check
exercise_07()

13


In [17]:
# At what hour do we have the maximum number of tweets?
def exercise_08():
    r = df['date'].apply(lambda x : x.strftime("%H")).value_counts().idxmax()
    result = print(r) 
    return result

In [18]:
# run and check
exercise_08()

23


In [19]:
# Who is the most active user in the dataset?
def exercise_09():
    result = df['user'].value_counts().idxmax()
    return result

In [20]:
# run and check
exercise_09()

'VioletsCRUK'

In [21]:
# What is the average number of tweets in a day produced by the most active user (°)?
def exercise_10():
    vi = df.loc[df['user'] == 'VioletsCRUK']
    vi2 = vi.set_index('date')
    vi2 = vi2.resample('D').size()
    res = vi2[vi2 != 0]
    result = round(res.mean(),1)
    return result

In [22]:
# run and check
exercise_10()

2.6

In [23]:
# What is the mean score of all tweets (°)?
def exercise_11():
    result = round(df['score'].mean(), 1)
    return result

In [24]:
# run and check
exercise_11()

2.2

In [25]:
# What is the mean score of tweets containing the string "good" (°)?
def exercise_12():
    tweets = df.loc[df['text'].str.contains('good'), 'score']
    result = round(tweets.mean(),1)
    return result

In [26]:
# run and check
exercise_12()

2.6

In [27]:
# What is the mean score of tweets containing the string "bad" (°)?
def exercise_13():
    tweets = df.loc[df['text'].str.contains('bad'), 'score']
    result = round(tweets.mean(),1)
    return result

In [28]:
# run and check
exercise_13()

0.9

In [29]:
# What is the mean score of tweets issued by the most active user found in question 9 (°)?
def exercise_14():
    vi = df.loc[df['user'] == 'VioletsCRUK']
    result = round(vi['score'].mean(),1)
    return result

In [30]:
# run and check
exercise_14()

3.1

In [31]:
# Which text is the most frequent one in all tweets?
def exercise_15():
    result = df['text'].value_counts().idxmax()
    return result

In [32]:
# run and check
exercise_15()

'good morning '

In [33]:
# How many different users issued the most frequent tweet?
def exercise_16():
    result = df.loc[df['text'] == 'good morning ', 'user'].nunique()
    return result

In [34]:
# run and check
exercise_16()

114

In [35]:
# What is the mean score of all tweets issued by those users (°)?
def exercise_17():
    filtrusers = df[df['text'] == 'good morning ']
    a = list(filtrusers['user'])
    FS = df.loc[df['user'].isin(a) == True]
    result = round(FS['score'].mean(),1)
    return result

In [36]:
# run and check
exercise_17()

3.7

In [37]:
# Which user is the most quoted one (the result should be a string starting with an `@`)?
def exercise_18():
    pattern = '^@'
    dfa = df.loc[df['text'].str.contains(pattern)]
    result = dfa['text'].value_counts(ascending = True).idxmax().split(' ',1)[0]
    return result

In [38]:
# run and check
exercise_18()

'@mileycyrus'

In [39]:
# How many different users issued at least a tweet quoting the most quoted user?
def exercise_19():
    pattern = '@mileycyrus'
    dfa = df.loc[df['text'].str.contains(pattern)]
    result = dfa['user'].nunique()
    return result

In [40]:
# run and check
exercise_19()

307

In [41]:
# What is the mean score of tweets quoting the most quoted user (°)?
def exercise_20():
    pattern = '@mileycyrus'
    dfa = df.loc[df['text'].str.contains(pattern)]
    result = round(dfa['score'].mean(),1)
    return result

In [42]:
# run and check
exercise_20()

3.0