# Cleaning
***
## Learning Objectives
* Understand the importance of data cleaning
* Learn about the different types of data cleaning

## Introduction
In this lesson, we'll learn about the importance of data cleaning and the different types of data cleaning. We will be using the Netflix dataset for this lesson.
[Netflix](https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization).

Cleaning up data involves:
* Removing data that is incorrect, irrelevant, duplicated, or incomplete
* Fixing data that is improperly formatted, inaccurate, or missing
* Removing outliers



### Imports

In [None]:
# Import the dataset

import pandas as pd

df = pd.read_csv('netflix1.csv')
df.head()


### Data Types


In [None]:
# Get all the column types
df.dtypes

[What is an object data type in a Pandas Dataframe?](https://g.co/bard/share/e1f01baab64e) 

In Pandas, an object data type is used to store data that is not of a specific type, such as strings, numbers, or dates. This can be useful for storing data that may be of different types, such as a column that contains both names and ages.

In the code you provided, the `Name` column has an object data type because it contains strings. The `Age` column has an int64 data type because it contains integers. The `Occupation` column also has an object data type because it contains strings.

Object data types are less efficient than other data types, such as int64 and float64, because they require more memory to store. However, they are more flexible and can be used to store a wider variety of data.

If you are working with a Pandas DataFrame that has a lot of object data types, you may want to consider converting some of the columns to other data types, such as int64 or float64. This can improve the performance of your code.

Here are some additional things to know about object data types in Pandas:

* Object data types are stored as Python objects.
* Object data types are not ordered, meaning that the values in the column are not sorted in any particular order.
* Object data types can contain mixed types, meaning that the column can contain strings, numbers, and other objects.
* Object data types are less efficient than other data types, such as int64 and float64.

I hope this helps! Let me know if you have any other questions about object data types in Pandas.

From the above, we need to convert the objects to strings in the `df`. We also need to convert `date_added` to a datetime object.

```python
df['date_added'] = pd.to_datetime(df['date_added'])
df['release_year'] = df['release_year'].astype(str)
df['rating'] = df['rating'].astype(str)
df['duration'] = df['duration'].astype(str)
df['listed_in'] = df['listed_in'].astype(str)
df['description'] = df['description'].astype(str)

```

The code above was auto-generated. It's not perfect but it gives us a good starting point: 

In [None]:
# show_id         object
# type            object
# title           object
# director        object
# country         object
# date_added      object
# release_year     int64
# rating          object
# duration        object
# listed_in       object

# Convert date_added to datetime
df['date_added'] = pd.to_datetime(df['date_added'])
df.dtypes
df.head()

In [None]:
#  Get all the rows in duration that end in 'min'
mins = df['duration'].str.endswith('min')

mins.head()

# Add a new column duration_min that contains the duration in minutes
df['duration_min'] = df.loc[mins, 'duration'].str.extract('(\d+)').astype(int)
df 


## Missing data


### Strings

In [None]:

# Check for NA values
df.isna().sum()


We're in luck. It seems that there are no missing values in the dataset. For completeness, Let's look at examples of missing data
[Here](https://g.co/bard/share/3722d7573f0f) is a string example:

In [None]:
df_missing_string = pd.DataFrame({"Name": ["John Doe", "Jane Doe", None, "Peter Smith"], "Age": [30, 25, 40, 20]})
df_missing_string

Notice the missing name - `None`.  To identify all the missing values in a dataframe, we can run the code above, again: 

In [None]:
df_missing_string.isnull().sum()

In [None]:
# Return all the rows with missing values in the Name column
df_missing_string[df_missing_string['Name'].isnull()]

In [None]:
df_missing_string['Name'].isnull()

In [None]:
df_missing_string.isna().sum()

The `sum()` method returns the total missing values. We might be interested in getting a dataframe of values instead: 

In [None]:
df_missing_string.isnull()

You can use this to replace, or manipulate the missing values in the dataset:

In [None]:
# Replace missing values with "NOT_SPECIFIED"
df_missing_string['Name'] = df_missing_string['Name'].fillna("NOT_SPECIFIED")
df_missing_string

In [None]:
df_missing_string['Name'].fillna("NOT_SPECIFIED")

### Numbers
Let's look at numbers. [Here](https://g.co/bard/share/580c6f647392) is a df that has some missing data

There are several options for filling in missing numbers in a dataframe in pandas. Some of the most common methods include:
- Forward fill (ffill): propagates the last valid observation forward to fill the missing values
- Backward fill (bfill): propagates the next valid observation backward to fill the missing values
- Mean: fills the missing values with the mean of the non-missing values in the column
- Median: fills the missing values with the median of the non-missing values in the column
- Mode: fills the missing values with the mode of the non-missing values in the column

To fill missing values in a pandas dataframe, you can use the `fillna()` method. Here is an example of how to fill missing values with the mean:



In [None]:
df_missing_numbers = pd.DataFrame({"Name": ["John Doe", "Jane Doe", "Joe Park", "Peter Smith"], "Age": [30,None, 40, 20]})
df_missing_numbers

The same methods are used for identifying the missing values: 

In [None]:
df_missing_numbers.isna()

We can choose to assign a specific value the missing value: 

In [None]:
df_missing_numbers['Age'] = df_missing_numbers['Age'].fillna(0) 
df_missing_numbers

Or we can perform a more complicated operation, like finding the average of the values in a column:


In [None]:
# Recreate the dataframe
df_missing_numbers = pd.DataFrame({"Name": ["John Doe", "Jane Doe", "Joe Park", "Peter Smith"], "Age": [30,None, 40, 20]})
df_missing_numbers['Age'] = df_missing_numbers['Age'].fillna(df_missing_numbers['Age'].mean()) # Mean, means average
df_missing_numbers

### Removing Rows with Missing Values
Say we have a DataFrame and we want to remove rows that contain missing values. Pandas provides the dropna() function that can be used to drop either columns or rows with missing data.

In [None]:
import pandas as pd

df_rows = pd.DataFrame({"Name": ["John Doe", "Jane Doe", None, "Peter Smith"], "Age": [30, 25, 40, 20], "Height": [170, 180, None, 160]})
print(df_rows)

In [None]:
# If you want to remove all the rows that do not have a height value, you can use the dropna() method.

df_rows.dropna(subset=['Height'], inplace=True)
df_rows.head()

Specifying no subset will check if all the columns are null.

In [None]:
df_rows = pd.DataFrame({"Name": ["John Doe", "Jane Doe", None, "Peter Smith"], "Age": [30, 25, None, 20], "Height": [170, 180, None, 160]})
df_rows

In [None]:
df_rows.dropna( inplace=True)
df_rows

## Duplicate rows

Here is an example of a dataset that contain duplicated rows 


In [None]:
import pandas as pd

# Sample dataset with duplicated rows
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'David', 'Bob'],
    'Age': [25, 30, 25, 22, 28, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Houston', 'Los Angeles']
}

df = pd.DataFrame(data)

print(df)


Notice  rows 1 and 5 as well as 0 and 2 are duplicated pairs. We can only show them by executing the following code: 

In [None]:
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)

At this point, we have two options: 
* Keep the first duplicated row (the default behavior of `drop_duplicates`)
* Drop *all* duplicated rows
* Keep the last duplicated row

Let's look at dropping all duplicated rows first.

In [None]:
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)


Drop all duplicates in the dataset


In [None]:
df_drop_all_duplicates = df.drop_duplicates(keep=False)
df_drop_all_duplicates

Keep the last duplicate row and drop the rest

In [None]:
df_last_occurrence = df.drop_duplicates(keep='last')
print(df_last_occurrence)

## Working with strings

Let's look at the Netflix dataset: 


In [None]:
df = pd.read_csv('netflix1.csv')
df.head()


`show_id` has an *s* at the start of the name. If we can remove the *s*, we can use the column in a more meaningful way. Python has a replace function that can do this for us.

In [None]:
df["show_id"] = df["show_id"].replace("s", "", regex=True)
df.head()

Nice. Let's convert the values to a numerical data type.

In [None]:
df["show_id"] = df["show_id"].astype(int)
df.dtypes["show_id"]
df 

There are many other ways to clean string data. Another common task is to use a part of the string, commonly referred to as a substring. Say we only want the first 2 letters of a string. Here is an example of how to do that.

In [None]:
df = pd.DataFrame({"Name": ["John Doe", "Jane Doe", "John Smith"], "Age": [30, 25, 40]})
df

In [None]:
# Get the first two letters of each name
df["First Two Letters"] = df["Name"].str[:2]

print(df)


The magic happens in the `str.[:2]` method. `[:2]` means ge the first two characters of a string.  You can also get the last two characters, for example, by doing: 

In [None]:
df = pd.DataFrame({"Name": ["John Doe", "Jane Doe", "John Smith"], "Age": [30, 25, 40]})

df["Last Two Letters"] = df["Name"].str[-2:]

print(df)

Here are some links to explain indexing (the [:2] and [-2:] in the code above):
* https://g.co/bard/share/a7be9b5da5c4
* https://www.datacamp.com/tutorial/python-list-index


The indexing method is handy but the length of the values you want to extract might not be consistent. If you want to extract the first names of the directors, you can use the split method. This method splits a string into a list of strings based on a separator. 

In [None]:
df = pd.read_csv('netflix1.csv')
df.head()

In [None]:
# Get the first names of the directors
df['director_name'] = df['director'].str.split(' ').str[0]
df.head()

`split(' ')` breaks the string into two, or more pieces. For every space `( )`, a new value will be added to the list. If you have 
`Kirsten Johnson` for example, `split(' ')` will create a list with two values: `['Kirsten', 'Johnson']`. 

`str[0]` gets the first value in the list. i.e. `Kirsten`

Super handy! 🎉



## Advanced Transformations
We're not quite done with Strings. Regular Expressions is a special type of language for matching patterns in text. It's a language that is available in almost all programming languages and it's extremely powerful once you master it. We'll cover the basics here, but if you want to learn more, check out the [Python Regular Expressions Documentation](https://docs.python.org/3/library/re.html).

### Regular Expressions
Imagine you have a dataframe like the one below:

In [None]:
# Sample DataFrame
data = {'Text': ['Alice has 3 apples.', 'Bob likes cats.', 'Charlie has 15 dogs.']}
df = pd.DataFrame(data)

df

How would you extract only the numbers in `Text`? Short answer - Regular Expressions.

In [None]:
# Extracting numbers using regular expression
df['Numbers'] = df['Text'].str.extract(r'(\d+)')

print(df)

`r'(\d+)'`  says match one or more digits. The key part is `\d` which says look for number  0 - 9. The `+` says look for one or more of the previous character. So, `\d+` says look for one or more digits. The parentheses group the digits together. This way we can extract them all at once as a group.

Here's a reference table of some commonly used regular expression (regex) patterns and their explanations:

| Pattern                | Explanation                                        |
|------------------------|----------------------------------------------------|
| `\d`                   | Matches any digit (0-9).                           |
| `\D`                   | Matches any non-digit character.                   |
| `\w`                   | Matches any word character (alphanumeric + underscore). |
| `\W`                   | Matches any non-word character.                   |
| `\s`                   | Matches any whitespace character (space, tab, newline, etc.). |
| `\S`                   | Matches any non-whitespace character.             |
| `.`                    | Matches any character except a newline.            |
| `^`                    | Matches the start of a string or line.            |
| `$`                    | Matches the end of a string or line.              |
| `[abc]`                | Matches any of the characters within the square brackets. |
| `[^abc]`               | Matches any character except those within the square brackets. |
| `[a-z]`                | Matches a character in the range 'a' to 'z'.     |
| `*`                    | Matches zero or more occurrences of the preceding pattern. |
| `+`                    | Matches one or more occurrences of the preceding pattern. |
| `?`                    | Matches zero or one occurrence of the preceding pattern. |
| `{n}`                  | Matches exactly 'n' occurrences of the preceding pattern. |
| `{n,}`                 | Matches 'n' or more occurrences of the preceding pattern. |
| `{n,m}`                | Matches between 'n' and 'm' occurrences of the preceding pattern. |
| `()`                   | Groups patterns together.                        |
| `\`                    | Escapes a special character or indicates a special sequence. |

These are just a few of the most commonly used regex patterns. Regular expressions can get quite complex, so this table serves as a starting point. If you want to learn more or explore more advanced patterns, you can refer to online regex documentation and tutorials.

Another useful site is [regex101](https://regex101.com/). This site allows you to test your regex patterns against a sample text of your choosing. It also explains each part of the regex pattern and highlights the characters that are matched by each part.

Here is another example of using a regex to extract information from a string:


In [None]:

# Sample DataFrame
data = {'Description': ['Product Code: ABC123', 'Product Code: DEF456', 'Other text']}
df = pd.DataFrame(data)
df 

In [None]:

# Extracting product codes using regular expression
pattern = r'Product Code: (\w+)'
df['Product Code'] = df['Description'].str.extract(pattern)

print(df)


The regex `Product Code: (\w+)` reads as follows:
Use the string `Product Code: ` to find a match.
Then, match any word character (letter, number, or underscore) one or more times.
The parentheses around `\w+` indicate that this is a capture group, which signals to Pandas that we want to extract this part of the regex. Put another way, we only want the part of the regex inside the parentheses.

Regular Expressions definitely takes a while to get used to, but they can be a powerful tool once you get the hang of them. 

The good news ,though, is that you can use Bard or Chat GPT to generate them for you. [Here](https://g.co/bard/share/55cbf48ad765) is the Bard version:

### Mappings
 

`map()` is a powerful dataframe method that lets you apply a function to every element in a column. For example, if you wanted to add 1 to every element in a column, you could do:

```python
df['column1'] = df['column1'].map(lambda x: x + 1)
```
Let's start with `lambda x: x + 1`. [Lambdas](https://g.co/bard/share/cfbc9c2a2f4a) are just quick functions that you can write in one line. The above lambda is the same as:

```python
def add_one(x):
    return x + 1
```
The `map()` method takes every element in the column it is being applied to and passes it into the lambda function you wrote. The result of the lambda function is then put into the new column. In the above example, we are taking every element in `column1`, adding 1 to it, and then storing the result in `column1`.

Here is a more concrete example of using `map()` to change country names:

```python
df['Country'] = df['Country'].map(lambda x: 'United States' if x == 'US' else x)
```

Here are few examples Chat-GPT gave me: 
```python
# Example 1: Lambda function with one argument
square = lambda x: x ** 2
print(square(4))  # Output: 16

# Example 2: Lambda function with multiple arguments
add = lambda x, y: x + y
print(add(5, 3))  # Output: 8

# Example 3: Lambda function as an argument to another function
numbers = [1, 2, 3, 4, 5]
squared_numbers = list(map(lambda x: x ** 2, numbers))
print(squared_numbers)  # Output: [1, 4, 9, 16, 25]
```



In [98]:
import pandas as pd

df = pd.DataFrame({"Name": ["John Doe", "Jane Doe", "John Smith"], "Age": [30, 25, 40]})
df 

Unnamed: 0,Name,Age
0,John Doe,30
1,Jane Doe,25
2,John Smith,40


In [103]:
def uppercase(string):
    return string.upper()

df["Name"] = df["Name"].map(uppercase)

print(df)

         Name  Age
0    JOHN DOE   30
1    JANE DOE   25
2  JOHN SMITH   40


`map()` can also take a dictionary as an argument. The dictionary should contain keys that match the values in the Series. The function `map()` will match each value in the Series to a key in the dictionary and replace that value with the associated dictionary value.

In [107]:

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Gender': ['M', 'F', 'M']}
df_2 = pd.DataFrame(data)

df_2

Unnamed: 0,Name,Gender
0,Alice,M
1,Bob,F
2,Charlie,M


In [112]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Gender': ['M', 'F', 'M']}
df_2 = pd.DataFrame(data)

# Mapping dictionary for gender codes to gender names
gender_mapping = {'M': 'Male', 'F': 'Female'}

# Using map() to replace gender codes with gender names
df_2['Gender'] = df_2['Gender'].map(gender_mapping)

print(df_2)


      Name  Gender
0    Alice    Male
1      Bob  Female
2  Charlie    Male
