#### Module 2: Data Cleaning and Preparation
2.1 **Data Cleaning Techniques**
   - Handling Missing Data
   - Handling Duplicates
   - Data Imputation

2.2 **Data Transformation**
   - Data Types Conversion
   - Data Normalization and Scaling
   - Handling Outliers

**2.1.1 Handling Missing Data**
- Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

- The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

- We call this a sentinel value: when present, it indicates a missing (or null) value:

In [4]:
import pandas as pd
import numpy as np
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

The ```isna``` method gives us a Boolean Series with ```True``` where values are null:

In [5]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [17]:
import pandas as pd

# Load the dataset with missing values
df = pd.read_csv("Datasets/BL-Flickr-Images-Book.csv")

In [20]:
df.shape

(8287, 15)

In [21]:
# Identify missing values
missing_values = df.isnull().sum()
# print(missing_values.head())
missing_values.head()

Identifier                 0
Edition Statement       7514
Place of Publication       0
Date of Publication      181
Publisher               4195
dtype: int64

In [48]:
cleaned_df = df.dropna()
cleaned_df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks


In [39]:
cleaned_df.shape

(0, 15)

In [41]:
cleaned_df = df.dropna(axis=1)
cleaned_df.head()

Unnamed: 0,Identifier,Place of Publication,Title,Contributors,Issuance type,Flickr URL,Shelfmarks
0,206,London,Walter Forbes. [A novel.] By A. A,"FORBES, Walter.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,London; Virtue & Yorston,All for Greed. [A novel. The dedication signed...,"BLAZE DE BURY, Marie Pauline Rose - Baroness",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,London,Love the Avenger. By the author of “All for Gr...,"BLAZE DE BURY, Marie Pauline Rose - Baroness",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,London,"Welsh Sketches, chiefly ecclesiastical, to the...","Appleyard, Ernest Silvanus.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,London,"[The World in which I live, and my place in it...","BROOME, John Henry.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [42]:
cleaned_df.shape

(8287, 7)

In [43]:
cleaned_df = df.fillna(0)  # Replace missing values with 0
cleaned_df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,0,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,0,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,0,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,0,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [44]:
cleaned_df.shape

(8287, 15)

In [45]:
cleaned_df['Date of Publication'].fillna(cleaned_df['Date of Publication'].mode()[0], inplace=True)
cleaned_df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,0,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,0,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,0,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,0,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",0.0,0.0,0,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [46]:
cleaned_df.shape

(8287, 15)

Here are various ways to clean the data using NumPy and Pandas:

### Using Pandas:

1. **Dropping Rows with Missing Values:**
   - Use `dropna()` to remove rows containing any missing values.
   ```python
   cleaned_df = df.dropna()
   ```

2. **Dropping Columns with Missing Values:**
   - Use `dropna()` with the `axis` parameter set to 1 to remove columns containing any missing values.
   ```python
   cleaned_df = df.dropna(axis=1)
   ```

3. **Filling Missing Values:**
   - Use `fillna()` to replace missing values with a specified value (e.g., mean, median, mode).
   ```python
   cleaned_df = df.fillna(0)  # Replace missing values with 0
   ```

4. **Imputing Missing Values:**
   - Use more advanced techniques like mean, median, or mode imputation to fill missing values based on other values in the dataset.
   ```python
   cleaned_df['Date of Publication'].fillna(cleaned_df['Date of Publication'].mode()[0], inplace=True)
   ```

### Using NumPy:

1. **Dropping Rows with Missing Values:**
   - Convert the DataFrame to a NumPy array and then use array manipulation techniques to remove rows with missing values.
   ```python
   import numpy as np
   cleaned_array = df.to_numpy()
   cleaned_array = cleaned_array[~np.isnan(cleaned_array).any(axis=1)]
   ```

2. **Dropping Columns with Missing Values:**
   - Convert the DataFrame to a NumPy array and then use array manipulation techniques to remove columns with missing values.
   ```python
   cleaned_array = df.to_numpy()
   cleaned_array = np.delete(cleaned_array, np.isnan(cleaned_array).any(axis=0), axis=1)
   ```

3. **Filling Missing Values:**
   - Convert the DataFrame to a NumPy array and then use array manipulation techniques to fill missing values.
   ```python
   cleaned_array = df.to_numpy()
   cleaned_array[np.isnan(cleaned_array)] = 0  # Replace missing values with 0
   ```

4. **Imputing Missing Values:**
   - Use NumPy's advanced array operations along with Pandas' functionality to impute missing values.
   ```python
   mode = df['Date of Publication'].mode()[0]
   df['Date of Publication'].fillna(mode, inplace=True)
   ```

Choose the method that best fits your data and cleaning requirements. Each approach has its advantages and may be more suitable depending on the specific characteristics of your dataset.

In [57]:
import pandas as pd
import numpy as np

# Load the dataset with missing values
df = pd.read_csv("Datasets/BL-Flickr-Images-Book.csv")

# Select only numeric columns
numeric_cols = df.select_dtypes(include=np.number)

In [55]:
# Check for NaN values
cleaned_array = df[numeric_cols.columns].to_numpy()
cleaned_array

array([[2.060000e+02,          nan,          nan,          nan],
       [2.160000e+02,          nan,          nan,          nan],
       [2.180000e+02,          nan,          nan,          nan],
       ...,
       [4.159563e+06,          nan,          nan,          nan],
       [4.159587e+06,          nan,          nan,          nan],
       [4.160339e+06,          nan,          nan,          nan]])

In [56]:
# Check for NaN values
cleaned_array = cleaned_array[np.isnan(cleaned_array).any(axis=1)]
cleaned_array

array([[2.060000e+02,          nan,          nan,          nan],
       [2.160000e+02,          nan,          nan,          nan],
       [2.180000e+02,          nan,          nan,          nan],
       ...,
       [4.159563e+06,          nan,          nan,          nan],
       [4.159587e+06,          nan,          nan,          nan],
       [4.160339e+06,          nan,          nan,          nan]])