# Introduction
<div class="alert alert-info"> 
This notebook presents data cleanup process. The raw data is located in the folder raw_data.
<div>

## Importing python libraries

In [1]:
import scipy
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


warnings.filterwarnings("ignore")

<div class="alert alert-info"></div>

## Read data
<div class="alert alert-info">
From the pandas library, the read_json and read_csv methods are used. For the ranking data, the columns names
("BrukerID","FilmID","Rangering","Tidstempel") are added. These columns names are given in README file.
<div>

In [24]:
users_raw_data = pd.read_json("raw_data/bruker.json", orient="split")
films_raw_data = pd.read_excel("raw_data/film.xlsx")
rankings_raw_data = pd.read_csv("raw_data/rangering.dat",sep="::",names=["BrukerID","FilmID","Rangering","Tidstempel"])

In [25]:
users_raw_data

Unnamed: 0,BrukerID,Kjonn,Alder,Jobb,Postkode
0,0,,45.0,6.0,92103
1,1,M,50.0,16.0,55405-2546
2,2,M,18.0,20.0,44089
3,3,M,,1.0,33304
4,4,M,35.0,6.0,48105
...,...,...,...,...,...
6035,6036,M,45.0,0.0,61821
6036,6037,F,,,
6037,6038,,25.0,16.0,33301
6038,6039,M,35.0,14.0,92075


In [26]:
films_raw_data

Unnamed: 0,Denne filen inneholder
0,FilmID
1,Tittel
2,Sjanger


In [27]:
rankings_raw_data

Unnamed: 0,BrukerID,FilmID,Rangering,Tidstempel
0,0,616,4,959441640.0
1,0,1561,7,959441640.0
2,0,1540,6,959441640.0
3,0,88,5,959441640.0
4,0,620,8,959441640.0
...,...,...,...,...
900183,6040,1153,4,976584194.0
900184,6040,3714,4,976584260.0
900185,6040,2834,5,976584260.0
900186,6040,48,5,976584300.0


<div class="alert alert-info"></div>

## Inspecting the data basic statistics
<div class="alert alert-info">
use the pandas describe method to obtain the mean, standar deviation, the min, the max, the first quartile, the second quartile and the third quartile.
<div>

In [28]:
users_raw_data.describe()

Unnamed: 0,BrukerID,Alder,Jobb
count,6040.0,5046.0,5447.0
mean,3020.465894,30.666072,9.104278
std,1743.799216,12.954723,11.239708
min,0.0,1.0,0.0
25%,1510.75,25.0,3.0
50%,3020.5,25.0,7.0
75%,4530.25,35.0,14.0
max,6040.0,56.0,99.0


In [29]:
films_raw_data.describe()

Unnamed: 0,Denne filen inneholder
count,3
unique,3
top,Tittel
freq,1


In [30]:
rankings_raw_data.describe()

Unnamed: 0,BrukerID,FilmID,Rangering,Tidstempel
count,900188.0,900188.0,900188.0,898696.0
mean,2991.861171,1989.674352,4.279477,972241400.0
std,1736.206736,1126.366837,1.971074,12146720.0
min,0.0,0.0,1.0,956703900.0
25%,1458.0,1037.0,3.0,965302900.0
50%,2967.0,1959.0,4.0,972990400.0
75%,4501.0,2963.0,5.0,975220200.0
max,6040.0,3952.0,10.0,1046455000.0


<div class="alert alert-info"></div>

##  Data type for each columns in the datasets

In [31]:
users_raw_data.dtypes

BrukerID      int64
Kjonn        object
Alder       float64
Jobb        float64
Postkode     object
dtype: object

In [32]:
films_raw_data.dtypes

Denne filen inneholder    object
dtype: object

In [33]:
rankings_raw_data.dtypes

BrukerID        int64
FilmID          int64
Rangering       int64
Tidstempel    float64
dtype: object

<div class="alert alert-info"> </div>

## Dealing with missing data
<div class="alert alert-info">
To deal with the missing values n the dataset, they are mainly two schemes. Either they are droped all together, or their values are imputed. In the imputation scheme, the missing values
labeled as 'NaN', 'None' can be replaced by some statistics or a given value, depending if the target column is categorical or numerical .
<div>

#### Dealing with users missing values data
<div class="alert alert-info">
To find out the scheme used, let compare the user data in the sample_data folder and the one in the raw_data folder.
<div>

In [34]:
users_sample_data = pd.read_csv("sample_data/bruker.csv")
users_sample_data.shape

(200, 5)

In [35]:
users_raw_data.shape

(6040, 5)

<div class="alert alert-info">
The sample data has less rows then the raw data. This suggest a presuposition that, the missing values were droped. Let varify this hypothesis by droping the missing values in the raw data
    <div>