# Task to do :

**Preprocessed the rating_small dataset for the upcoming EDA** :( remove duplicates, handle missing value and mistype columns, remove outliers )

The movies dataset -> https://www.kaggle.com/rounakbanik/the-movies-dataset

# 1. Loading the data

>## First we import all the useful libraries 
>- Path for filepath managing
>- pandas and numpy for dataframe manip 
>- and for dataviz seaborn and matplotlib

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

>##  We organise our different files directories

In [2]:
DATA_DIR = Path('../data')
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'

>##  We load the different datasets into dataframes using Pandas

In [3]:
raw_ratings_df = pd.read_csv(RAW_DIR / 'ratings_small.csv')
display(raw_ratings_df.sample(3))

Unnamed: 0,userId,movieId,rating,timestamp
82040,560,780,4.0,1452848439
53031,384,8972,4.0,1199658240
92017,608,2427,5.0,939362228


# 2. Preprocessing of the data 

The data is a subset of 100,000 ratings from 700 users on 9,000 movies.<br>
Columns contents:
- **userId:** The id of the user. Their ids have been anonymized. 
- **movieId:** The id of the movie. Movie ids are consistent between `ratings_small.csv`, `movies_metadata.csv`, and `links_small.csv` (i.e., the same id refers to the same movie across these data files).
- **rating:** The 5 star rating of the movie made buy the user. Each line of this file after the header row represents one rating of one movie by one user.
- **timestamp:** Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.



## 2.1. Information on the data

In [4]:
raw_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [5]:
# Let us check if there is any duplicate rows
raw_ratings_df.duplicated().value_counts()

False    100004
dtype: int64

> The rating table:
>- has 100004 rows and 4 columns:
    >    - NO missing values  
    >    - NO rows are duplicated 
>- 1 float and 3 integers 
>- amongst 20 objects :
    >    -  One integer columns contains the timestamp which is the number of seconds between a particular date and January 1, 1970 at UTC. **( we will handle this later by converting it into date using date time)**


## 2.2. Handling missing, incorrect and invalid data

### a. missing values

In [6]:
# We don't have any missing value but a type to correct ( timestamp)

#### b. mistype

In [7]:
from datetime import datetime

# we convert timestamp in datetime
raw_ratings_df ['date'] = raw_ratings_df.timestamp.apply( lambda x: datetime.fromtimestamp(x))
# we drop the timestamp column and save the result to process files
process_rating_df = raw_ratings_df.drop('timestamp', axis = 1)


In [8]:
process_rating_df['rating_year'] = process_rating_df.date.dt.year
# process_rating_df['rating_month'] = process_rating_df.date.dt.month
# process_rating_df['rating_day'] = process_rating_df.date.dt.day
# process_rating_df['rating_hour'] = process_rating_df.date.dt.hour
# process_rating_df['rating_minute'] = process_rating_df.date.dt.minute
process_rating_df

Unnamed: 0,userId,movieId,rating,date,rating_year
0,1,31,2.5,2009-12-14 03:52:24,2009
1,1,1029,3.0,2009-12-14 03:52:59,2009
2,1,1061,3.0,2009-12-14 03:53:02,2009
3,1,1129,2.0,2009-12-14 03:53:05,2009
4,1,1172,4.0,2009-12-14 03:53:25,2009
...,...,...,...,...,...
99999,671,6268,2.5,2003-10-08 04:16:10,2003
100000,671,6269,4.0,2003-10-03 04:46:41,2003
100001,671,6365,4.0,2003-12-09 04:26:03,2003
100002,671,6385,2.5,2003-12-09 15:21:03,2003


In [9]:
# We can save because it seems we won't change this data anymore
process_rating_df.to_csv(PROCESSED_DIR / 'ratings_small.csv', index=False)