In [1]:
import pandas as pd

df = pd.read_csv('../product_view_data.csv')

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   user_id        11 non-null     int64  
 1   product_id     11 non-null     int64  
 2   liked          11 non-null     bool   
 3   view_duration  8 non-null      float64
 4   source         11 non-null     object 
 5   timestamp      11 non-null     object 
dtypes: bool(1), float64(1), int64(2), object(2)
memory usage: 579.0+ bytes


Here, in df.info(), we notice that we have missing data on view_duraction column. We will fix this.

In [7]:
df.iloc[3:5, :]

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
3,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
4,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677


And we can see that we have two duplicated rows. We will fix this too.


In [8]:
df.dtypes


user_id            int64
product_id         int64
liked               bool
view_duration    float64
source            object
timestamp         object
dtype: object

Here, we see problems with dtypes. Timestamp is represented as string, when ideally, it should be represented as a datetime object. This will be fixed too.

## Dealing with missing data
In this context, we can deal with it using the mean as a good replace for the missing values. We can use the inplace parameter to get sure that it will also change the original dataset.


In [11]:
mean = df['view_duration'].mean()
df['view_duration'].fillna(mean, inplace=True)
df

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
0,3987,997021,True,1.048242,web,2017-09-23 00:18:29.056895
1,6300,865003,True,1.688173,web,2017-09-21 02:20:22.022096
2,6451,712951,False,0.93835,mobile,2017-09-07 11:57:50.044683
3,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
4,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
5,5700,587019,False,0.493194,web,2017-09-07 00:25:07.019097
6,3400,505123,True,0.93835,web,2017-09-07 13:53:21.008403
7,8403,459916,False,0.675041,mobile,2017-09-25 21:54:00.028323
8,8965,943363,False,0.93835,web,2017-09-17 15:12:21.059489
9,9693,787546,True,0.101743,web,2017-09-26 12:34:46.012559


## Dealing with duplicated data


In [15]:
sum(df.duplicated())
df.drop_duplicates(inplace=True)

0


## Dealing with incorrect datatypes


In [16]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
