# Introduction to Data Analysis using Pandas

In this tutorial, we will learn how to do a simple data analysis using pandas.

**pandas** is a Python library that makes it easy to work with structured data—like what you’d see in Excel spreadsheets or SQL tables.

In Pandas universe, tables are called a dataframe. It consists of rows and columns. 
You can create a dataframe from lists, dicts, or other data types. However usually you creating using a data you collected. 

You can load CSVs (comma separated values), JSONs, Excel Files (xlsx). 
You can compress csvs and jsons using bzip (so they will be .csv.bz2 or .json.bz2) and still load them as usual. Pandas handles uncompressing without modifying the files.

We will use a Twitter (now X) dataset that consists of 500k tweets related to ChatGPT between January and March 2023. [Link](https://www.kaggle.com/datasets/khalidryder777/500k-chatgpt-tweets-jan-mar-2023)

Let's load our sample data using read_csv() function.

As the data consists of 500,000 entries and is 117 mb when uncompressed, it is better to use a sample for learning and exploration purposes. You can first load the data and then take a random sample. However, in a scenario where the data is massive (more than a gigabyte), it is more practical to load the first n rows using `nrows` argument of `read_csv` function.
We load the first 10000 rows. 

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/twitter_data_chatgpt.csv.bz2', nrows = 10000) # use index_col to specify and index column
df

Every dataframe has an index, which uniquely identifies each row and plays an important role in data alignment and operations like selection, joining, and reshaping (that we will see later)

By default, if you don't specify an index, pandas assigns one automatically: a sequence of integers from 0 to n-1, where n is the number of rows. (This is called a RangeIndex.)

In [None]:
df.index

You can specify the index explicitly by `set_index` function (if you have not done while reading the csv). 

An index should be unique, i.e., should map to a single row. Twitter assigns an unique id to each tweet, which is an excellent candidate for an index


In [None]:
df.set_index("id", inplace=True)
df

Note that some pandas operations like set_index return the new (modified) dataframe after calling a function.
You need to assign it to a dataframe variable, e.g., `df = df.set_index("id")`
You can either do that, or use `inplace = True` to avoid it. Does not matter which one you choose, but it's better to be consistent. We will use the latter approach

You can use the index to access certain rows. Use `.loc` for that.

In [None]:
df.loc['1641213230730051584'] # the row with the specified index

Wait, why this did not work? Because id column was an integer :))

In [None]:
df.loc[1641213003260633088]

You can access multiple rows using a list

In [None]:
ids = [1641213003260633088, 1641212975012016128, 1641213230730051584]
df.loc[ids]

You can also access certain rows using their numeric position. However this is not recommended because the numeric position may change if you sort or shuffle the data, which is not the case with an index.
For that, use `.iloc` (stands for "integer location")

In [None]:
df.iloc[0] #accesses the first row

Use double brackets return a dataframe with single or multiple columns (useful when you want to access a column(s).
Note: using a single bracket to access a single column returns a "Series"

In [None]:
df[['content']]

In [None]:
df[['date', 'content']]

Note that once you set the id to be an index, it is no longer a column.

In [None]:
df[['id']] # will throw a KeyError

#### Quick Exploration
Here are some useful quick-exploration functions

Head shows the first rows, default 5, can be specified. 

In [None]:
df.head() # first 5 rows

In fact, entering the variable of the dataframe defaults to calling its head function with n = 10

In [None]:
df

Tail does the same but return the last rows (defaults to 5)

In [None]:
df.tail() # last rows

Calling len() on a dataframe shows its number of rows.

In [None]:
len(df)

Shape is an attribute of a dataframe that holds information related to its size.
It is a tuple. The first element shows the number of rows. The second element shows the number of columns. 
Our dataframe is 10000 x 5 so:

In [None]:
df.shape # notice that we do not use parenthesis, as shape is an attribute not a function

Columns attribute holds the column names

In [None]:
df.columns

Note that this is a Pandas index. You can convert it to a list however

In [None]:
df.columns.to_list() # or list(df.columns)

info() gives some info that may or may not be useful

In [None]:
df.info()

describe() provides statistics that are often useful

In [None]:
df.describe()

value_counts() provides an exhaustive **sorted** list of the values stored in column(s). It is quite meaningless to call it for the entire dataframe. However it is *very useful* in looking at the values in a single column. 

Let's see who are the most active users in this Twitter dataset

In [None]:
df.username.value_counts()

This says that __yuhanito__ is the most occurring value in username. In other words, they tweeted the most (in the first 10000 rows of the dataset)

# Analysis
The data we loaded consists of only the first 10000 rows and the data is sorted according to the date so we only loaded the tweets posted in March. It does not provide reliable findings for a temporal analysis. 
Thus, we will load the full data this time, but sample later for fast analysis.

We load the first 10000 rows to a variable named `small_df` for comparison purposes that you will see soon

In [None]:
df = pd.read_csv('data/twitter_data_chatgpt.csv.bz2')
small_df = pd.read_csv('data/twitter_data_chatgpt.csv.bz2', nrows = 10000) # for comparison purposes

## Data Preprocessing
When dealing with data, you will invest a great deal of time and effort to preprocessing: cleaning the data, creating features and making the data ready to whatever analysis you will do. It is the fundamental part of data engineering.
AI made this part easy, but it is important to learn these skills, so you will know what you are doing.

### Cleaning
The data provided in Kaggle is in fact dirty. Normally, the string columns should be wrapped in double quotes so the newlines (\n) in the text will be recognized as part of the value instead of indicating a new row in a dataframe. The author of the dataset broke this principle in some rows. This resulted in columns mixing together, e.g., the username became date, date became id and so on. We will now clean this mess.

In [None]:
df

Because the columns are messed up, their dtypes are not correct. 

In [None]:
df.dtypes

Compare with the dtypes in the dataframe of first 10000 rows (which we named `small_df`)

In [None]:
small_df.dtypes

Because of the unexpected linebreaks in the column `content`, some rows got splitted in two. This made the superseding columns had null values. 
We can identify these rows using isnull(). 
isnull() creates a dataframe of booleans.


In [None]:
df.isnull()

We want to get the rows where **any** value is null, i.e., each row of the boolean dataframe will aggregated and will return True if any of the columns has the value True. To do that, we use any(). Because we aggregate for each row on columns, we use axis = 1. # This is a bit confusing, rewrite

In [None]:
df.isnull().any(axis=1) # now we get a single bool for each row

Using brackets on a list of bools will return the rows where the bool is True.
In this case, the list of bools is a *mask*, which is used to *filter data*. We will go over this concept thoroughly later.

In [None]:
df[df.isnull().any(axis=1)]

Now we see that the line 34984 is broken into three and carries over to 34985 and 34986
The line 56153 is broken into three and carries over to 56154 and 56155
The line 114179 is broken into two and carries over to 114180
... and so on.

What to do in such a situation? In fact, Twitter and nearly all other major social media platforms such as Reddit, YouTube, TikTok provide data in JSON format which is robust to this problem. If you have the original (raw) data, the best solution is to reread the data and create a clean csv. Since the author of the dataset did not provide such data, we have to move on.

The other solution is to read the csv line by line and fix the bad lines, create a clean csv and then load it.

The easiest solution is to drop the erroneous lines and fix the columns' dtypes. This is not a good solution if there are many affected rows. However, in our case, it is only 20-30 tweets out of 500,000. So we will go with this solution. It also provides a good exercise for other cleaning steps

#### Dropping Rows with Null Values
Drop the rows with null values using dropna()

In [None]:
number_of_rows_earlier = len(df)
df.dropna(how = 'any', inplace = True)
rows_dropped = number_of_rows_earlier - len(df)
print(f'Number of rows dropped: {rows_dropped}')

Now we fix the columns' dtypes
We use astype() to do this. Note that this we are reassigning the columns, we can't use inplace

In [None]:
df['id'] = df['id'].astype(int)
df['like_count'] = df['like_count'].astype(int)
df['retweet_count'] = df['retweet_count'].astype(int)

Now that we fixed the issue with the column id, we can again make it the index

In [None]:
df.set_index("id", inplace=True)

#### Renaming columns
An important preprocessing step to give your columns clear names that will be compatible with additional data. 
Twitter API names tweet creation date as "created_at" and tweet text as "text". The Kaggle author did a poor job in naming those fields. "date" may be confused with the type date and  and "content" is not used elsewhere. So we will fix those

Use rename() to rename columns. Use a dictionary where the keys will refer to the current names and values refer to the new names.

In [None]:
df.rename(columns = {'date':'created_at', 'content': 'text'}, inplace = True)

In [None]:
df

#### Concatenating Dataframes
In a scenario where we need to combine multiple dataframes into one, we can do so by using concat().
We do not have such a scenario in hand right now but we can append the small dataframe to our dataframe to create duplicates for our next exercise.

We will assign the combined dataframe to a dataframe named combined_df. You will shortly see why.

In [None]:
print(f'Length of the dataframe: {len(df)}')
print(f'Length of the small dataframe: {len(small_df)}')
print(f'Length of the combined dataframe is supposed to be: {len(df) + len(small_df)}')
combined_df = pd.concat([df, small_df]) # Write to a new dataframe so we preserve
print(f'New length of the dataframe: {len(combined_df)}')

Now look at the head and the tail of the combined dataframe. The head will show the rows from the first dataframe, and the tail will show the rows from the second, smaller one. 

What do you see?

In [None]:
combined_df.head()

In [None]:
combined_df.tail()

That looks awful is not it? 

What do you notice? Answer before continue reading 

![This is just so you do not see what's written below](https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Muhabbet_kuşu_açık_mavi.jpg/1024px-Muhabbet_kuşu_açık_mavi.jpg)

First, small_df did not have an index. id was still a column. Because of that, the combined dataframe has a new id column which shows null for the first part of the dataset as that dataset used id as the index.

Secondly, because small_df did not have an index, Pandas assigned a RangeIndex to small_df which goes from 0 to 10000. Hence the index in the tail of the combined dataframe goes from 9995 to 10000.

Thirdly, as we renamed content as text and date as created at in the first part of the dataset but did not do the same modifications to the small_df, we now have the same columns with both names and null values. 

The morale of the story: Always make sure the dataframes you are going to concatanate has the same structure

Let's introduce the same changes to small_df as the df and then try again

In [None]:
small_df.set_index('id', inplace=True)
small_df.rename(columns = {'date':'created_at', 'content': 'text'}, inplace = True)
combined_df = pd.concat([df, small_df])

In [None]:
combined_df.head()

In [None]:
combined_df.tail()

### Identifying and Dropping Duplicates
Duplicates may skew your results or may yield in meaningful results. 
You should consider what type of duplicates may be problematic for your analysis and define & drop duplicates accordingly. 

We will consider multiple cases of duplicates and learn how to deal with them.

#### 1) Duplicate Indexes
In Pandas, an index does not have to be unique, but a unique index is strongly recommended for clarity, performance and correctness.

On Twitter, ids are serve as index: they are unique and are used as the address of the tweet, e.g., "1895669466786402519" in [https://x.com/TheMisterFrog/status/1895669466786402519](https://x.com/TheMisterFrog/status/1895669466786402519) is the tweet id.

If your Twitter dataset has rows with the same ids (which we just set as index), either they are the same exact tweets or tweet ids are stored or read incorrectly.

We do not have this problem in our dataset. However combined_df has it since we appended a subset of the data, creating many duplicates.

In [None]:
df.index.is_unique

In [None]:
combined_df.index.is_unique

To drop rows with duplicated indexes, we first, identify the rows where the index is duplicated using `df.index.duplicated(keep='first')` This will create a boolean mask that has True for rows with duplicated indexes **except for their first instance**.
Then we return a dataframe where the index is NOT duplicated (using ~) and assign it to itself. Again, we will come to masking in a bit

In [None]:
combined_df = combined_df[~combined_df.index.duplicated(keep='first')]

Now, the combined_df's index will be unique too

In [None]:
combined_df.index.is_unique

And combined_df just became equal to df

In [None]:
combined_df.equals(df)

#### Duplicate Rows
In some cases, the index (e.g., id) may be unique but the values are the same. Interestingly, this is the case in our dataset:

In [None]:
df[df.duplicated()]

Seems like a clumsily programmed Twitter bot got a fatal error and tweeted "@gpt_chatgpt Response exceeds tweet character limit" 11 times (plus 2 tweets from random people). 
This is not a big deal for a dataset of 500,000 tweets. But let's drop them anyway.
This one is easy, just call `drop_duplicates()`.

In [None]:
df.drop_duplicates(inplace = True)

You may also consider the rows in which the same user tweeted the same thing are as duplicates even if they did not tweet those at the same time.
Use `subset` to consider such duplicated columns. 

In [None]:
df[df.duplicated(subset = ['username', 'text'])] # same user tweets the same text

It appears that there are 3588 such duplicates. Again not a big number to skew results. But let's drop these too.
We again use `drop_duplicates` but this time use a subset.
`drop_duplicates` keeps the first instance of a duplicated row as default. you can opt for keeping the last instance with `keep = 'last'` or drop all duplicates with `keep = False`. We will continue keeping the first instance, so no need to put anything

In [None]:
df.drop_duplicates(subset = ['username', 'text'], inplace = True)

### Creating and Dropping Columns

You can create new columns using a simple assignment statement.
The following will assign the same value to each for the new column. So, a constant column

In [None]:
df['year'] = 2023 # every row has the same value

You can also use the preexisting columns to create new columns by simple operations

In [None]:
df['engagement_count'] = df['like_count'] + df['retweet_count']
df['like_retweet_difference'] = df['like_count'] - df['retweet_count']
df['like_retweet_ratio'] = df['like_count'] / df['retweet_count']
df

The year column was a bit useless, let's drop it. 
Use drop to drop columns

In [None]:
df.drop(columns = 'year', inplace=True)

.. or drop multiple columns

In [None]:
df.drop(columns = ['engagement_count', 'like_retweet_difference'], inplace = True)

You can also drop rows by using `df.drop(index = ['....'])` and provide the list of indexes (in our case, ids) of the rows you want to drop. We generally use masking to drop rows though, you will see soon (sorry for the suspense!)

### Handling Missing Values or NaNs (Not a Number)s
In the beginning of the notebook, we handled the bad rows by dropping nas. In some cases, the data is read correctly, the rows are ok. Some values are either missing or there are problems in their computation.

In our case, when we compute like_retweet_ratio, we received a lot of NaNs by dividing zero by zero as many tweets have zero likes and zero retweets. 


In [None]:
print(f'Number of NaNs in like_retweet_ratio: {len(df[df.like_retweet_ratio.isna()])}')

One obvious solution is to smooth the values before dividing, e.g., add 1 to like and retweet count and then delete. 

However, if you instead opt for a solution where you treat tweets with zero likes and retweets, you can instead fill the NaN values with -1 (or leave them as it is)

You can use `df.fillna(-1, inplace = True)` to fill all NaNs in the dataframe with -1s. However, it is a better practice to specify the columns where you will fill the NaNs

For filling nas in a column, use assignment statement

In [None]:
df['like_retweet_ratio'] = df['like_retweet_ratio'].fillna(-1)

There are other strategies for filling missing values, e.g., filling with the column's mean or median. For time-series data, you can use the value in the previous column (backward fill, `df.fillna(method='bfill')`) or the value in the next column (forward fill, `df.fillna(method='ffill'))`). You can explore these in your free time.

### Creating Complex Columns
You often create new columns for complex analysis or to use as feature for machine learning. Use apply() functions for that.

You can define new functions and use them in the apply() function to apply it to a column. The following function identifies and returns a list of hashtags in the tweet.

In [None]:
def extract_hashtags(text):
    hashtags = []
    splitted = text.split(' ') # split the text into words using spaces
    for word in splitted: # loop through each word
        if ((word.startswith('#') and len(word) > 1)): # check if the word starts with # and longer than 1, this means the word is a hashtag
            hashtags.append(word) # add the word to the list of hashtags        
    return hashtags # return the list of hashtags

We now apply extract hashtags function to text using apply() and assign the output to a new column called hashtags.

In [None]:
df['hashtags'] = df['text'].apply(extract_hashtags)

extract_hashtags is an input to the function apply(), not a function call itself, so you do not call it like `extract_hashtags(text)` and apply() already knows the input is the 'text' column

In cases where the custom function is not that complex, you can use a one-liner "lambda" function

In [None]:
df['hashtags'] = df['text'].apply(lambda text: [word for word in text.split(' ') if word.startswith('#') and len(word) > 1]) 

`[word for word in text.split(' ') if word.startswith('#') and len(word) > 1]` is called list-comprehension and one of signature features of Python you should be familiar with or learn it in your free-time.

#### Custom function with multiple columns
What if you need to use multiple columns as input to the function you provide to apply()? We mix the two approaches we just learn.

First, let's say we want to identify "self-replies", the tweets which contain the username of their author. This is often the case when users author a Twitter flood / thread.

We define a custom function for that:

In [None]:
def identify_self_replies(text, username):
    reply_username = f'@{username}' # replies start with @
    return text.startswith(reply_username) # returns true if the text starts with the @username, indicates a self-reply

We will use a lambda function again. 
However, this time, the input of the lambda function will be the row. Thus, we do not use apply on a particular raw but on the dataframe directly: `df.apply(..`
and we will provide axis = 1 as input to the apply() function to tell pandas to apply the function **row-wise**.

In [None]:
df['is_self_reply'] = df.apply(lambda row: identify_self_replies(row['text'], row['username']), axis=1)

Again, we can use one-liners instead of custom functions

In [None]:
df['is_self_reply'] = df.apply(lambda row: row['text'].startswith(f'@{row["username"]}'), axis=1)

Let's see how many self_replies we got:

In [None]:
df.is_self_reply.value_counts()

Seems like not many, but that's life :) 

This was the part one.
The part two will cover:
Masking to filter rows
Sorting
Group by and aggregations
Merging and joining
Crosstab
Covariation and Correlation
Basic visualizations such as scatterplot and histogram
Writing data
Handling dates
Sampling
Reshaping and Pivoting
Read JSONs