# Preprocessing

## Loading our data from S3

In [None]:
from pyspark.sql import functions as F

In [None]:
filepath = "s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv"

In [None]:
playlog = (spark.read.format('csv')\
             .option('header', 'true')\
             .option('inferSchema', 'true')\
             .load(filepath))
playlog.show(5)

## First analysis
1. Print out our DataFrame's schema

2. Use `.describe(...)` on your DataFrame

Unnamed: 0,summary,timestamp,user,song
0,count,25739537.0,25739537.0,25739537
1,mean,1442700656.1045842,12697.352275450798,2.532571778181818E8
2,stddev,34432848.72371195,13094.065905828476,8.334645614940468E8
3,min,-139955897.0,0.0,---AtpxbkaE
4,max,1554321113.0,45903.0,zzzcFgRMY6c


### Missing values check

3. Count the missing values for each column put the result in a pandas DataFrame and print it out.
*TIP: you may use dictionnary comprehension in order to create the base to build the DataFrame from*


Unnamed: 0,timestamp,user,song
missing values,0,0,0


### Duplicates check

4. Check if playlog without duplicates has the same number of rows as the original.

Seems like we have duplicates, let's count how many.

5. Figure out a way to count the number of duplicates.

### Other checks
6. Order the dataframe by ascending `timestamp` and show the first 5 rows.

Do you see anything suspicious?

The first timestamp is negative, and it seems like it's the only one.  
We will make sure there aren't other like this.

7. count the number of rows with a negative timestamp

As expected, only one such negative timestamp. Since we have only one we can actually `.collect(...)` it.

8. Collect the problematic rows

There's only one problematic value among more than 25M.  This negative timestamp is an error, as such the real value is missing. We could try to reconstruct the real value but that would be a really tedious task, since it's one value over 25M, we will simply remove it.

## Removing the row with a negative timestamp

We will use our new knowledge about the data to perform some preprocessing.  

Our pipeline will have 2 steps:
* Remove duplicates (123651 rows)
* Remove row with negative timestamps (1 row)

We will call our new DataFrame `playlog_processed` and save it to S3 in parquet format.

9. Filter out:
* duplicated values
* rows with negative timestamp
* save the result to a new DataFrame: `playlog_processed`
* Finally, print out the number of rows in this DataFrame

10. save the processed DataFrame to S3 using the parquet format for this you may use the the method .write.parquet(...)
*You may use this path 's3://full-stack-bigdata-datasets/Big_Data/playlog_processed_student.parquet'*