# Chapter 1

- See datatype of columns : `df.dtypes`
- See dataframe information : `df.info()`
- See column information : `df.describe()`
- Remove character : `df['col'].str.strip('$')`
- Drop values : `df.drop(df[df['col'] > 5].index, inplace = True)`
- Set values :
    1. `df.loc[df['col'] > 5, 'col2'] = 5`
    2. `df.at[df['col'] > 5,'col2']= 5`
- Convert type : 
    - Other data type : `df['col'].astype('int')`
    - Catergorical : `df["col"].astype('category')`
    - Date : `df['date_column'] = pd.to_datetime(df['datetime_column']).dt.date`
- Test : `assert df['col'].dtype == 'int'`

### Tips to filter unclean data

- Check if integer values are meant to be categorical values
- check for unwanted characters
- Check if numerical values are converted into strings
- Check for out of range values
- Check for missing values
- Check if dates are in proper format
- Check for range of dates (if there are impossible values)
- Check for duplicates (both complete and partial)
- Checking constrains
- cross-field validation : sanity / validity check using multiple fields

### Finding Duplications

```
# Drop complete duplicates
df.drop_duplicates(inplace = True)

# Column names to check for partial duplicates
column_names = ['A','B','C']
duplicates = df.duplicated(subset = column_names, keep = False)
# See partial duplicate values
df[duplicates]

# Combine result for partial duplicates
summaries = {'D': 'max', 'E': 'mean'}
df = df.groupby(by = column_names).agg(summaries).reset_index()

```

# Chapter 2

- Filter out inconsistent categorical data by comparing them with known categories using anti-join
- uppercase : `df['col'].str.upper()`
- lowercase : `df['col'].str.lower()`
- Remove character : `df['col'].str.strip('$')`
- Replace character : 
    1. `df['col'] = df['col'].apply(lambda x: x.replace('_', '+'))`
    2. `df['col'] = df['col'].str.replace('_', '+')`
    3. `df['col'] = df['col'].str.replace(r'\D+', '')`


### Constraints

- Type constrains : data type
- Range constrains : Range of data
- Uniqueness constrains : Unique value of row
- Membership constain : Known member in a group (eg: month from 1 to 12, week from 1 to 7 etc)

### Creating categories

```
import pandas as pd
# Way 1 : Equal cut
group_names = ['0-200K', '200K-500K', '500K+']
df['cat'] = pd.qcut(df['range_col'], q = 3, labels = group_names)

# Way 2 : More precise
ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+']
# Create income group column
df['cat'] = pd.cut(df['range_col'], bins=ranges, labels=group_names)

# Way 3 : Mapping
mapping = {'MALE':'M', 'Male':'M', 'FEMALE':'F', 'Female':'F'}
df['cat'] = df['string_col'].replace(mapping)
```

### Testing

```
assert df['col'].dtype == 'int'
assert phone['Phone number'].str.contains("+|-").any() == False
```

# Chapter 3

### Missing data

1. Missing Completely at Random: 
	- No systematic relationship between a column's missing values and 
    other or own values. 
    - eg : errors when inputting data
2. Missing at Random: 
	- There is a systematic relationship between a 
	column's missing values and other observed values.
    - eg : missing ozone data for high temperature
3. Missing not at Random: 
	- There is a systematic relationship between a 
	column's missing values and unobserved values.
    - eg : missing temperature values for high temperature

### Handling Missing Data

```
# Show number of missing data
df.isna().sum()

# Visualize missing data information
import missingno as msno
import matplotlib.pyplot as plt
msno.matrix(airquality)
plt.show()

# Drop missing data
df_dropped = df.dropna(subset = ['col'])

# Replace/impute missing data with single value
col_mean = df['col'].mean()
df_imputed = df.fillna({'col': col_mean})

# Replace/impute missing data with series
series_imp = df['col1'] * 5
df_imputed = df.fillna({'col2':series_imp})

# Missing values are not always "NaN". They can be blank, "?" or other symbols (rarely)
# Check for values through manual validations first
df["col"].value_counts() # Look out for suspicious values
# Determine number of missing values in a column
df.isna().any()
df['col'].isnull().sum()
# Drop missing values
df.dropna(axis = 0) # Drop entire row for missing value (default)
df.dropna(axis = 1) # Drop entire column for missing value
# Drop missing values for specific column
df.dropna(subset = ["col"], axis = 0)
# Replace missing values
df["col"].replace(np.nan, new_val)
df.fillna(0)
```

### Date in pandas

```
# Way 1
df["date_col"] = pd.to_datetime(df["date_col"], 
                                infer_datetime_format=True,
                                errors='coerce')
# Way 2
df["date_col"] = df["date_col"].dt.strftime("%d-%m-%Y")
# Extract month information
dataframe["date_col"].dt.month
# Extract year information
dataframe["date_col"].dt.year
```