# General tasks and directions

- Add your name, today's date, and the assignment title to the designated cell.
- Write your answers in the cells that contain `Add your answer here.` line.
- Write your code in the cells that contain `# Add your implementation here.` line.
- Use autograder tests that are provided for your convenience.
- Don't change or delete any provided code (including [cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) such as `%%capture output`).


Add your answer here.
Name: Ratanak Uddam Chea
Date: 03/14/2023
Task: exercise 3

# Airline tweets

*Using `pandas` to clean data*.

This assignment is individual and you agree to submit your own work.


In [1]:
import numpy as np
import pandas as pd

total_points = 0
task_points = 5

## Task 1

Read airline sentiment analysis tweets from the provided file *airline_tweets.csv*.

In [2]:
# Add your implementation here.
df = pd.read_csv("airline_tweets.csv")

In [3]:
# This cell is provided for your convenience to view the initial data status
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

In [4]:
assert df.shape == (14640, 15)
total_points += task_points

In [5]:
assert list(df.columns) == ['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
                           'negativereason', 'negativereason_confidence', 'airline',
                           'airline_sentiment_gold', 'name', 'negativereason_gold',
                           'retweet_count', 'text', 'tweet_coord', 'tweet_created',
                           'tweet_location', 'user_timezone']
total_points += task_points

## Task 2

Remove columns where more than one third (`33%`) of the values are missing.
You must do that programmatically, not hard-coding the column names.

In [6]:
# Add your implementation here.
df.drop(df.columns[df.isna().sum()/df.shape[0] * 100 > 33], axis = 1, inplace=True)

In [7]:
assert df.shape == (14640, 11)
total_points += task_points

In [8]:
assert list(df.columns) == ['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
                           'negativereason_confidence', 'airline', 'name', 'retweet_count',
                            'text', 'tweet_created', 'tweet_location', 'user_timezone']
total_points += task_points

## Task 3

Remove duplicate rows and rows with more than `2` missing values.

In [9]:
# Add your implementation here.
df = df.drop_duplicates().dropna(thresh=df.shape[1]-2)

In [10]:
assert df.shape == (13980, 11)
total_points += task_points

In [11]:
assert not (df.isna().sum(axis=1) > 2).any()
total_points += task_points

## Task 4

Impute missing values in the columns as follows:

- negativereason_confidence: the median value of the column
- tweet_location: "Decorah, IA"
- user_timezone: the most common value of the column


In [12]:
# Add your implementation here.
df.negativereason_confidence.fillna(df.negativereason_confidence.median(), inplace=True)
df.tweet_location.fillna("Decorah, IA", inplace=True)
df.user_timezone.fillna(df.user_timezone.value_counts().idxmax(), inplace=True)

In [13]:
assert abs(df.negativereason_confidence.mean() - 0.645769) < 0.001
total_points += task_points

In [14]:
assert f"{df.tweet_location.mode().iloc[0]} ({df.user_timezone.mode().iloc[0]})" == \
                "Decorah, IA (Eastern Time (US & Canada))"
total_points += task_points

## Task 5

Set the data types of the columns as follows:

| column                       | data type                      |
|------------------------------|--------------------------------|
| tweet_id                     | built-in 64-bit integer        |
| airline_sentiment            | `pandas` categorical           |
| airline_sentiment_confidence | built-in 64-bit floating-point |
| negativereason_confidence    | built-in 64-bit floating-point |
| airline                      | `pandas` string                |
| name                         | `pandas` string                |
| retweet_count                | built-in 64-bit integer        |
| text                         | `pandas` string                |
| tweet_created                | `pandas` datetime (UTC)        |
| tweet_location               | `pandas` string                |
| user_timezone                | `pandas` categorical           |



In [15]:
# Add your implementation here.
new_dtypes = ["int64", "category", "float64", "float64",
                           "string", "string", "int64", "string",
                           "datetime64[ns, UTC]", "string", "category"]

for i, j in zip(df.columns, new_dtypes):
    df[i] = df[i].astype(j)

In [16]:
assert df.shape == (13980, 11)
total_points += task_points

In [17]:
assert list(df.dtypes) == ["int64", "category", "float64", "float64",
                           "string", "string", "int64", "string",
                           "datetime64[ns, UTC]", "string", "category"]
total_points += task_points

In [18]:
# This cell is provided for your convenience to view the result
# All 11 columns must have 13980 non-null values and proper data types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13980 entries, 0 to 14639
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype              
---  ------                        --------------  -----              
 0   tweet_id                      13980 non-null  int64              
 1   airline_sentiment             13980 non-null  category           
 2   airline_sentiment_confidence  13980 non-null  float64            
 3   negativereason_confidence     13980 non-null  float64            
 4   airline                       13980 non-null  string             
 5   name                          13980 non-null  string             
 6   retweet_count                 13980 non-null  int64              
 7   text                          13980 non-null  string             
 8   tweet_created                 13980 non-null  datetime64[ns, UTC]
 9   tweet_location                13980 non-null  string             
 10  user_timezone                 1398

In [19]:
print(f"Total points: {total_points}/50.")

Total points: 50/50.


## Submission Checklist

- [ ] Your name, today's date, and the assignment title in the designated cell.
- [ ] Your answers in the designated cells (if required).
- [ ] Your code runs and produces the expected output.
- [ ] The validity of your code is verified by autograders (if provided).
- [ ] Restart the kernel and run all cells (in the menubar, select *Kernel*, then *Restart Kernel and Run All Cells*).
- [ ] Save the notebook.
- [ ] Submit the assignment.
