In [2]:
import pandas as pd

df = pd.read_csv("product_view_data.csv")
df.head()

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
0,3987,997021,True,1.048242,web,2017-09-23 00:18:29.056895
1,6300,865003,True,1.688173,web,2017-09-21 02:20:22.022096
2,6451,712951,False,,mobile,2017-09-07 11:57:50.044683
3,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
4,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677


- main potnetial data issues that are quite common in data analysis are:
    - missing data 
    - duplicates 
    - wrong data types 

## missing data 
- should be handeled depending on several factors such as 
    - the reason these values are missing (is it missing at random or not)
    - the type of the data
    - etc
- we can handle them in different ways 
    - imputing them with the mean or any other value
    - drop the rows that contain missing values
    - etc

In [3]:
# check for null values 
df.isnull().sum(axis=0)

user_id          0
product_id       0
liked            0
view_duration    3
source           0
timestamp        0
dtype: int64

In [6]:
# replace the null values with the mean
mean = df["view_duration"].mean()
df["view_duration"] = df["view_duration"].fillna(mean)

In [8]:
df

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
0,3987,997021,True,1.048242,web,2017-09-23 00:18:29.056895
1,6300,865003,True,1.688173,web,2017-09-21 02:20:22.022096
2,6451,712951,False,0.93835,mobile,2017-09-07 11:57:50.044683
3,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
4,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
5,5700,587019,False,0.493194,web,2017-09-07 00:25:07.019097
6,3400,505123,True,0.93835,web,2017-09-07 13:53:21.008403
7,8403,459916,False,0.675041,mobile,2017-09-25 21:54:00.028323
8,8965,943363,False,0.93835,web,2017-09-17 15:12:21.059489
9,9693,787546,True,0.101743,web,2017-09-26 12:34:46.012559


## duplicates
- there are several reasons why duplicates might occur 
    - combining data from multiple sources
    - human error
- we can use the `duplicated` method to check for duplicates and the `drop_duplicates` method to remove them
    - by default, the `duplicated` method will keep the first occurance of the duplicate and mark the rest as duplicates
    - it considers the row as a duplicate if the values in all the columns match
    - we can change both of these behaviors using certain parameters
        - `subset` parameter in both `duplicated()` and `drop_duplicated()` to consider only a subset of the columns when checking for duplicates

In [9]:
# check for duplicates 
df.duplicated()

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
dtype: bool

In [10]:
# access the duplicated rows 
df[df.duplicated()]

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
4,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677


In [11]:
# count them 
sum(df.duplicated())

1

In [13]:
# to drop duplicates 
df.drop_duplicates(inplace=True)

In [14]:
df

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
0,3987,997021,True,1.048242,web,2017-09-23 00:18:29.056895
1,6300,865003,True,1.688173,web,2017-09-21 02:20:22.022096
2,6451,712951,False,0.93835,mobile,2017-09-07 11:57:50.044683
3,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
5,5700,587019,False,0.493194,web,2017-09-07 00:25:07.019097
6,3400,505123,True,0.93835,web,2017-09-07 13:53:21.008403
7,8403,459916,False,0.675041,mobile,2017-09-25 21:54:00.028323
8,8965,943363,False,0.93835,web,2017-09-17 15:12:21.059489
9,9693,787546,True,0.101743,web,2017-09-26 12:34:46.012559
10,4107,811855,False,3.112086,web,2017-09-01 10:50:07.042593


## incorrect data types
- this is also a common problem

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 0 to 10
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   user_id        10 non-null     int64  
 1   product_id     10 non-null     int64  
 2   liked          10 non-null     bool   
 3   view_duration  10 non-null     float64
 4   source         10 non-null     object 
 5   timestamp      10 non-null     object 
dtypes: bool(1), float64(1), int64(2), object(2)
memory usage: 490.0+ bytes


- we can see that the data types of the columns using `dtypes` attribute of the dataframe, or using the `info` method
    - we see above that the `date` column is of type `object` (string) and not `datetime` as it should be
        - datetimes are much more convenient to work with than strings, because we can do operations on them such as extracting the year, month, day, etc or filter them easily 

In [17]:
pd.to_datetime(df["timestamp"])

0    2017-09-23 00:18:29.056895
1    2017-09-21 02:20:22.022096
2    2017-09-07 11:57:50.044683
3    2017-09-17 03:48:20.019677
5    2017-09-07 00:25:07.019097
6    2017-09-07 13:53:21.008403
7    2017-09-25 21:54:00.028323
8    2017-09-17 15:12:21.059489
9    2017-09-26 12:34:46.012559
10   2017-09-01 10:50:07.042593
Name: timestamp, dtype: datetime64[ns]

In [18]:
# assign it back to the old columns 
df["timestamp"] = pd.to_datetime(df["timestamp"])
df.head()

Unnamed: 0,user_id,product_id,liked,view_duration,source,timestamp
0,3987,997021,True,1.048242,web,2017-09-23 00:18:29.056895
1,6300,865003,True,1.688173,web,2017-09-21 02:20:22.022096
2,6451,712951,False,0.93835,mobile,2017-09-07 11:57:50.044683
3,7782,283235,True,0.194162,mobile,2017-09-17 03:48:20.019677
5,5700,587019,False,0.493194,web,2017-09-07 00:25:07.019097


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 0 to 10
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   user_id        10 non-null     int64         
 1   product_id     10 non-null     int64         
 2   liked          10 non-null     bool          
 3   view_duration  10 non-null     float64       
 4   source         10 non-null     object        
 5   timestamp      10 non-null     datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 490.0+ bytes


- when we save it as a csv then read it again, it will be read as a string, so we need to reconvert it again
- the function `pd.to_datetime` provides more options to handle different date formats
    - we can use the `format` parameter to specify the format of the date
    - we can use the `errors` parameter to specify how to handle errors in the conversion
        - `coerce` will convert the errors to `NaT` (not a time) which is a special value for missing values in datetime columns
        - `ignore` will ignore the errors and keep the original values
        - `raise` will raise an error if there are any errors in the conversion