# Project: What Twitter users tweet about

## Introduction

The purpose of this project is to examine, analyze and draw conclusions regarding __which__ events from all around the world seem to __concern__ Twitter users. In order to do so, we are going to use the dataset of Swiss tweets and the Twitter-leon dataset of global tweets. However, we first need to determine the context in which this analysis will take place. In other words, we need to define what the words in bold, in the first sentence, mean in the context of this project:

1. The word __which__, in this context, basically refers to the types of events that interest Twitter users. An event could be anything, from a festival to a scientific discovery, or even a terrorist attack. Therefore it is important that we define the bigger <u>categories</u> of events that we are going to examine, with regard to their "appearances" in the Twitter dataset, as well as the <u>specific events</u> that we could extract from the dataset, that can be sorted under these bigger categories.

2. In order to determine if an event "__concerns__" Twitter users, we are going to define a few <u>metrics</u> that we will use. Some of these metrics could be the number of tweets that refer to an event (by either using specific hasthags or by containing specific keywords that we will define), the number of retweets to a tweet that refers to a specific event, the number of days after an event that tweets referring to the event keep appearing, etc. 

From all the above, we can see that we first need to explore and analyse our dataset a little bit, in order to be able to answer these two questions in a sufficient way, according to the possibilities and contraints imposed by the datasets, before we dive into the core part of the project.

## Part I: Determine the datasets' properties

### Chapter 1: Swiss tweets dataset

In this chapter, we will take a first look at the Swiss tweets dataset, in order to determine the dataset schema, the meaning of columns, the number of rows and all other information that we can retrieve from a first, superficial analysis. After running a few Python scripts on the cluster (all of which, you can find in the project Github folder), we retrieved the following information about the dataset:

1) The dataset's schema is the following:

For the purposes of this analysis, we chose to use only the following columns:

2) The total number of rows in the dataset is ...

3) The dataset contains tweets written in ... different languages, with the following distribution of tweets per language:

4) The dataset contains tweets from ...-...-2016 up until ...-...-2016

5) Since, from our point of view the data of this dataset are effectively "found data" (meaning that we were not the ones collecting them), we are interested in making sure that our dataset is reliable (un-biased) enough  to lead to valid conclusions about the events that concern Swiss Twitter users. In order to be able to determine the dataset's reliability, we are going to use the __Bootstrap resampling__ method that we discussed in the course lectures, and examine the distribution of tweets per user in each different sample, in order to spot possible biases in the data.

### Chapter 2: Global tweets dataset

Let us now take a first glance at the global tweets dataset (tweets-leon), in the same way that we did in chapter 1:

1) The dataset's schema is the following:

In [1]:
import os
import pandas as pd

DATA_DIR = '.'

data = pd.read_csv(os.path.join(DATA_DIR, 'head.csv'), 
                 delimiter='\t',
                 header = None,
                 names=['language', 'tweet_id', 'datetime', 'username', 'tweet_text'])
data.head(5)

Unnamed: 0,language,tweet_id,datetime,username,tweet_text
0,en,345963923251539968,Sat Jun 15 18:00:01 +0000 2013,Letataleta,RT @silsilfani: the world is not a wish-granti...
1,en,345963923297673217,Sat Jun 15 18:00:01 +0000 2013,JamesonN7,RT @WhosThisHoe: I'd rather sleep with a nice ...
2,en,345963923259924480,Sat Jun 15 18:00:01 +0000 2013,LauraEllynJones,Can't stand people who lie then blame it on so...
3,it,345963923276697601,Sat Jun 15 18:00:01 +0000 2013,ChialettaFClub,@ChialettaFClub: #rt seguimi ti seguo ti voto ...
4,fr,345963923255730176,Sat Jun 15 18:00:01 +0000 2013,_irem61_,RT @DHC_Music: Terrorism ... #FreePalestina ht...


The column headers are not included in the dataset, but it is easy to infer what each column could represent, just by looking at the first 5 rows. 

2) The total number of rows in the dataset is __~18 billion__. However, during our observation of the rows, we found that there are around 8 million rows with less than the 5 columns that are visible in the schema abocve. After looking into these problematic rows a bit further, we discovered that they can have any number of columns between 0 and 4. In addition, there are cases in which a single tweet can span to multiple rows, while containing no other column, apart from the tweet_text. As we will see below, in our analysis, we are really interested in having the number of tweets referring to a specific event, therefore, these problematic rows could lead to counting the same tweet multiple times and obtaining unrealistic results. For this reason, we decided to remove these rows from our analysis, also considering that 8 million out of 18 billion total rows is a rather insignificant percentage (0.04%).

3) The dataset contains tweets written in 6 different languages, with the following distribution of tweets per language:

In [2]:
tweets_per_lang = pd.read_csv(os.path.join(DATA_DIR, 'tweets_per_language.csv'), 
                  delimiter=",",
                  index_col = 0)
tweets_per_lang

Unnamed: 0_level_0,count
language,Unnamed: 1_level_1
french,676 529 769
dutch,452 780 443
italian,466 666 820
german,452 126 737
english,12 488 903 036
spanish,3 439 067 021


4) The dataset contains tweets from 15-06-2013 18:00GMT up until 01-02-2016 00:00GMT.

5) Once again, we are interested in making sure that our dataset is reliable, therefore we are going to use the __Bootstrap resampling__ method and compare the distribution of tweets per user of each "sub-sample" to one another.

## Part II: Determining the event categories and specific events to be used

Now, that we know a little bit more about our two datasets, we are going to describe the types of events we would like to find in them and see if we can indeed spot them. Our main idea is to divide our events into categories and . Let's first determine the event categories we would like to use. 

Our main idea is to sort the events in 5 big categories:

1. Political events <br/>
2. Sports events <br/>
3. Social crises (terrorist attacks, disease outbreaks, etc.) <br/>
4. Science and technology <br/>
5. Showbiz and viral trends (e.g. trending tv series, videogames etc.) <br/>



## Part III: Determining  the metrics to be used

### Chapter 1: Swiss tweets dataset

### Chapter 2: Global tweets dataset

<u>Note:</u> As we saw in part I, the tweets span from mid June 2013 up until the very begining of 2016. Therefore, in order to be sure for the validity of our results, we are only going to examine events that happened before  

## Part IV: Dataset analysis and conclusions