**Problem:**

You are given the following dataset:
1. **Audible Data** : https://1drv.ms/u/s!AiqdXCxPTydhoog8ckLN-6Cw55fzIg?e=EWgZ5d

Your task is to:
- Find the problems with the datasets.
- Define the Data Quality Dimensions.
- Try to clean the datasets.

In [2]:
import pandas as pd
import numpy as np
audible = pd.read_csv('audible_uncleaned.csv')

In [8]:
audible.to_excel('audible.xlsx',index=False)

### Data Quality Dimensions

- Completeness -> is data missing?
- Validity -> is data invalid -> negative height -> duplicate patient id
- Accuracy -> data is valid but not accurate -> weight -> 1kg
- Consistency -> both valid and accurate but written differently -> New Youk and NY

### Order of severity

Completeness <- Validity <- Accuracy <- Consistency

### Data Cleaning Order

1. Quality -> Completeness
2. Tidiness / Messy data
3. Quality -> Validity
4. Quality -> Accuracy
5. Quality -> Consistency

#### Steps involved in Data cleaning
- Define
- Code
- Test

`Always make sure to create a copy of your pandas dataframe before you start the cleaning process`

In [4]:
#creating copy 
audible_df = audible.copy()

### Dirty Data in audible table
`name`: has extra information beside book name `consistency`

`author and narrator`: multiple entries present in single cell, some cell contains `-`  `,` `consistency`

`stars`: null value present as not rated yet `completion`

`narrator`: has entries named anonymous , Anonymous `completion`

`price`: some cells contains `,`  like 1,000.00 `validity`



### Messy data in audible table
`author`: unnecessary words are present that do not contribute to any meaning such as 

`author` and `narrator` : has unnecessary information as 'written by' and 'narrated by'

`author and narrator`: first name and last name are not separated,different languagues present

`time` and `released date`: should be in datetime format

`stars` renamed as rating and should have only rating not a whole sentence and it's type should be int

`price`: has object datatype should be int

`language`: cotains first letter as small and somtimes capital -> should be in one form
r ice



 


`releasedate` , `language` , `stars` , `price` : has incorrect datatype should be datetime64,category,float,int

### Automatic Assessment

- head and tail
- sample
- info
- isnull
- duplicated
- describe

In [6]:
audible_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         87489 non-null  object
 1   author       87489 non-null  object
 2   narrator     87489 non-null  object
 3   time         87489 non-null  object
 4   releasedate  87489 non-null  object
 5   language     87489 non-null  object
 6   stars        87489 non-null  object
 7   price        87489 non-null  object
dtypes: object(8)
memory usage: 5.3+ MB


In [8]:
#handling coulmn stars
audible_df['stars'].sample(10)

#values present as ['Not rated yet','4.5 out of 5 stars41 ratings',]

46387                   Not rated yet
48091                   Not rated yet
12639                   Not rated yet
32516    4.5 out of 5 stars11 ratings
43997                   Not rated yet
23872                   Not rated yet
13885                   Not rated yet
59196    4.5 out of 5 stars22 ratings
74638                   Not rated yet
1736                    Not rated yet
Name: stars, dtype: object

In [10]:
# we will create a new column named as rating
audible_df['ratings'] = audible_df.stars.str.split('stars').str.get(1).str.split().str.get(0)

In [12]:
#changing the datatype of columns ratings to int
audible_df['ratings'] = pd.to_numeric(audible_df.ratings,downcast='float',errors='coerce')

In [16]:
#filling NA values in ratings to 0 so that they can be converted into int
audible_df.ratings = audible_df.ratings.fillna(0).astype('int32')

In [18]:
#handling coulmn stars
audible_df['stars'].sample(10)

21925               Not rated yet
3219                Not rated yet
47654               Not rated yet
32554    5 out of 5 stars1 rating
22140               Not rated yet
68494               Not rated yet
80433               Not rated yet
23471               Not rated yet
18025               Not rated yet
20529               Not rated yet
Name: stars, dtype: object

In [20]:
#Replacing 'Not rated yet' with 0 and extracting star rating and storing it in 
audible_df.stars = audible_df.stars.replace('Not rated yet',0).str.split('out').str.get(0)

In [22]:
#converting its datatype to float32 to save memory
audible_df.stars = audible_df.stars.astype('float32').fillna(0)

In [26]:
audible_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         87489 non-null  object 
 1   author       87489 non-null  object 
 2   narrator     87489 non-null  object 
 3   time         87489 non-null  object 
 4   releasedate  87489 non-null  object 
 5   language     87489 non-null  object 
 6   stars        87489 non-null  float32
 7   price        87489 non-null  object 
 8   ratings      87489 non-null  int32  
dtypes: float32(1), int32(1), object(7)
memory usage: 5.3+ MB
