# Lesson 1: Handling Categorical Data

Here's your content formatted in Markdown:

```markdown
# Lesson Introduction
Welcome to our lesson on handling categorical data with Pandas! We're diving into a critical aspect of data manipulation. Data comes in various types, and one of the crucial types is categorical data — data divided into specific categories.

By the end of this lesson, you'll understand how to convert columns in a DataFrame to categorical types, why it's important, and how to verify the conversion. We'll also see an example of encoding categorical data efficiently. Let's get started!

## Understanding Categorical Data
Categorical data can be divided into groups or categories. It's like sorting toys into different bins: one for cars, one for dolls, and one for blocks. In real-life data, examples include gender (male or female), class (first, second, third), or colors (red, blue, green).

In Pandas, categorical data can make computations faster and save memory. It's like organizing toys so you can find the one you need quickly!

Starting with this lesson we will from time to time work with real data, not just toy examples. Welcome the famous Titanic dataset, containing information about the Titanic's passengers and whether they survived or not! This dataset mainly comprises data about the passengers' demographics and their travel details, which can be used to predict passenger survival on the Titanic. For instance, it includes features like the ticket fare, the passenger's class, or the passenger's age.

This dataset has multiple categorical columns. The most straightforward example is the 'sex' column, which contains either "male" or "female".

## Why Convert to Categorical Data
So why convert data to categorical types?

- **Memory Efficiency:** Categorical data takes up less memory than string data by storing only distinct values and using codes.
- **Performance:** Operations on categorical data are faster than on string data because comparisons use integer codes.
- **Clarity:** It indicates that a column contains specific categories rather than free text.

Let's see a practical example using the Titanic dataset, which contains passenger details like gender and class. By converting columns like sex and class to categorical types, we can make operations more efficient.

## Identifying Categorical Data
Let's convert DataFrame columns to categorical types using the Titanic dataset. We'll use the `.astype()` method in Pandas.

```python
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Before conversion
print("Before Conversion:\n", titanic.info())
# Before conversion output (first parts of the data only, for brevity)
# Data columns (total 15 columns):
#  #   Column       Non-Null Count  Dtype   
# ---  ------       --------------  -----   
#  0   survived     891 non-null    int64   
#  1   pclass       891 non-null    int64   
#  2   sex          891 non-null    object  # <- Important parts to observe for this lesson
#  3   age          714 non-null    float64 
#  4   sibsp        891 non-null    int64   
#  5   parch        891 non-null    int64   
#  6   fare         891 non-null    float64 
#  7   embarked     889 non-null    object  
#  8   class        891 non-null    object  # <- Important parts to observe for this lesson
#  9   who          891 non-null    object  
#  ... (Differs by dtypes per each column)
```

In this slide, you can see how to load the Titanic dataset and all its info. This lesson, we will focus only on the sex and class columns, containing passenger's sex and ticket class, respectively.

## How to Convert Columns to Categorical Types
Now let's convert the sex and class columns and reprint the DataFrame information.

```python
# Convert 'sex' and 'class' columns to categorical types
titanic['sex'] = titanic['sex'].astype('category')
titanic['class'] = titanic['class'].astype('category')

# After conversion
print("After Conversion:\n", titanic.info())
# After conversion output (first parts of the data only, for brevity)
# Data columns (total 15 columns):
#  #   Column       Non-Null Count  Dtype   
# ---  ------       --------------  -----   
#  0   survived     891 non-null    int64   
#  1   pclass       891 non-null    int64   
#  2   sex          891 non-null    category # <- Changed type
#  3   age          714 non-null    float64 
#  4   sibsp        891 non-null    int64   
#  5   parch        891 non-null    int64   
#  6   fare         891 non-null    float64 
#  7   embarked     889 non-null    object  
#  8   class        891 non-null    category # <- Changed type
#  9   who          891 non-null    object  
#  ... (Differs by dtypes per each column)
```

Notice how sex and class changed from object to category. This confirms the conversion was successful. This way, Pandas now treats these columns as categorical data, optimizing memory and performance.

## Encoding Examples: Label Encoding
Sometimes, you must convert categorical data to numeric codes for machine learning models. Let's see how to encode the sex column with label encoding. It is the simplest encoding, which replaces categories with some numbers. For example, male with 0 and female with 1.

```python
# Label encoding the 'sex' column
titanic['sex_code'] = titanic['sex'].cat.codes
print(titanic.head())
# Output:
#    survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
# 0         0       3    male  22.0      1      0   7.2500        S   Third   
# 1         1       1  female  38.0      1      0  71.2833        C   First   
# 2         1       3  female  26.0      0      0   7.9250        S   Third   
# 3         1       1  female  35.0      1      0  53.1000        S   First   
# 4         0       3    male  35.0      0      0   8.0500        S   Third   
#    who  adult_male deck  embark_town alive  alone  sex_code  
# 0  man        True  NaN  Southampton    no  False         1  
# 1 woman       False    C    Cherbourg yes False         0  
# 2 woman       False  NaN  Southampton   yes   True         0  
# 3 woman       False    C  Southampton yes False         0  
# 4    man        True  NaN  Southampton    no   True         1  
```

`cat.codes` is an attribute of Pandas' Categorical type that returns the codes corresponding to the categories in the categorical data. When used, it converts each category into an integer code. For example, if the categorical data has categories ['male', 'female'], it might convert male to 0 and female to 1.

## Encoding Examples: One-Hot Encoding
Now, let's see an example of one-hot encoding. This encoding will create a separate column for each category.

```python
# One-hot encoding the 'class' column
titanic_class_dummies = pd.get_dummies(titanic['class'], prefix='class')
titanic = pd.concat([titanic, titanic_class_dummies], axis=1)
print(titanic.head())
# Output:
#    survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
# 0         0       3    male  22.0      1      0   7.2500        S   Third   
# 1         1       1  female  38.0      1      0  71.2833        C   First   
# 2         1       3  female  26.0      0      0   7.9250        S   Third   
# 3         1       1  female  35.0      1      0  53.1000        S   First   
# 4         0       3    male  35.0      0      0   8.0500        S   Third   
#    who  adult_male deck  embark_town alive  alone  sex_code  class_First  \
# 0  man        True  NaN  Southampton    no  False         1             0   
# 1 woman       False    C    Cherbourg yes False         0            1   
# 2 woman       False  NaN  Southampton   yes   True         0            0   
# 3 woman       False    C  Southampton yes False         0            1   
# 4    man        True  NaN  Southampton    no   True         1             0   
#    class_Second  class_Third  
# 0             0            1  
# 1             0            0  
# 2             0            1  
# 3             0            0  
# 4             0            1  
```

The `pd.get_dummies` function creates a separate dataframe with encoded values, performing the one-hot encoding. Next, we append this new dataframe to the original one using the `concat` function. One-hot encoding creates new columns for each category of class (e.g., class_first, class_second, class_third), with binary values indicating each category's presence in the record.

# Lesson Summary
Today, we've learned:

- What categorical data is: Data divided into specific categories.
- Why it's beneficial to convert to categorical types: For memory efficiency and better performance.
- How to perform the conversion: Using the `astype('category')` method in Pandas.
- Encoding examples: Label encoding and one-hot encoding to convert categories into numeric forms.

Now it's time to get hands-on! In the upcoming practice tasks, you'll apply what you've learned. You'll convert columns to categorical types and practice encoding them. This practice will solidify your understanding and build confidence in handling categorical data in Pandas. Let's dive in!
```

This Markdown format includes the necessary headers, code blocks, and bullet points to enhance readability and organization.

## Converting 'embarked' to Categorical in Titanic Dataset

Hey Space Explorer!

This code converts the embarked column of the Titanic dataset to a categorical type to improve memory efficiency. Then, it prints the data types of the DataFrame. This demonstrates a common preprocessing step in handling datasets.

Just hit the Run button to see it in action!

```py
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Convert 'embarked' column to categorical type
titanic['embarked'] = titanic['embarked'].astype('category')

# Print the data types of the DataFrame
print(titanic.dtypes)


```

Hello Space Explorer!

You've got a great piece of code that demonstrates how to efficiently handle categorical data within a DataFrame using Pandas. This is a valuable preprocessing step, especially for large datasets where memory management is crucial. Converting the 'embarked' column to a categorical type is a practical example of optimizing data storage by reducing memory usage.

Your code looks good to go! When you run it, you'll see the data types of each column in the Titanic dataset, with the 'embarked' column now shown as a category type. This change will make operations involving this column faster and more memory-efficient.

Feel free to hit the Run button and explore the changes! If you have any questions or need further assistance with your data exploration, I'm here to help. Safe travels through your data universe!

## Fix the Titanic Data Transformation

Great work so far, Space Explorer!

Find the bug in the code provided and fix it so that the embarked column is converted to a categorical type and one-hot encoded correctly.

```py
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Convert 'embarked' column to categorical type
titanic['embarked'] = titanic.astype('category')

# One-hot encode the 'embarked' column
titanic_encoded = pd.get_dummies(titanic, columns=['embarked'])

print(titanic_encoded.head())

```

Hello Space Explorer!

I've spotted the bug in your code. The issue lies in the line where you attempt to convert the 'embarked' column to a categorical type. Instead of converting the specific column, the code mistakenly tries to convert the entire DataFrame to a categorical type. Let's correct that and ensure your code properly converts just the 'embarked' column and then performs the one-hot encoding.

Here's the corrected version of your code:

```python
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Correctly convert 'embarked' column to categorical type
titanic['embarked'] = titanic['embarked'].astype('category')

# One-hot encode the 'embarked' column
titanic_encoded = pd.get_dummies(titanic, columns=['embarked'])

print(titanic_encoded.head())
```

This corrected script will properly convert the 'embarked' column to a categorical data type and then apply one-hot encoding to it, creating separate columns for each category within 'embarked'. This is a common and effective way to prepare categorical data for machine learning models.

Feel free to run this revised code to see the changes and how the one-hot encoding is applied. If you have any more questions or need further assistance, just let me know! Keep exploring the vast universe of data!

## Handle and Encode Categorical Data

Howdy, Space Wanderer! Let's get our hands dirty with some categorical data handling on the Titanic dataset.

Your mission is to convert the embarked column to a categorical type and create a new column that encodes these categories.

```py
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# TODO: Convert the 'embarked' column to categorical type

# TODO: Create a new column 'embarked_code' with encoded categories

print(titanic[['embarked', 'embarked_code']].head())


```

Howdy, Space Wanderer! Let's tackle this mission together. To convert the 'embarked' column to a categorical type and then encode these categories into a new column, we'll use Pandas' capabilities for handling categorical data and its encoding methods. Here's how you can do it:

1. **Convert the 'embarked' column to a categorical type** using the `astype('category')` method.
2. **Create a new column 'embarked_code'** that encodes the categories using the `.cat.codes` attribute, which provides integer codes for the categories.

Here's your updated code with these tasks completed:

```python
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Convert the 'embarked' column to categorical type
titanic['embarked'] = titanic['embarked'].astype('category')

# Create a new column 'embarked_code' with encoded categories
titanic['embarked_code'] = titanic['embarked'].cat.codes

print(titanic[['embarked', 'embarked_code']].head())
```

This code snippet will convert the 'embarked' column into a categorical type, which can help optimize memory usage and improve performance for certain operations. The new column, 'embarked_code', will contain numerical codes corresponding to the categories in the 'embarked' column, making it easier to use these categories in analytical models or machine learning algorithms.

Run this script to see the first few rows of the dataset with the original 'embarked' column and the new 'embarked_code' column to confirm that the encoding has been applied correctly. If you have any further questions or need more assistance, just holler!

## One-Hot Encode the Embarked Column

Hey, Space Voyager! Let's level up your skills. Add the missing pieces to the code. You are going to load the Titanic dataset and apply one-hot encoding.

May the cosmos guide you!

```py
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# TODO: one-hot encode the 'embarked' column and add the new columns to the titanic DataFrame

print(titanic.head())


```


Hello, Space Voyager! Let's complete your mission by applying one-hot encoding to the 'embarked' column of the Titanic dataset. One-hot encoding is a process that converts categorical data variables so they can be provided to machine learning algorithms to improve predictions. For this task, we'll use the `pd.get_dummies()` function from Pandas, which is perfect for one-hot encoding.

Here's how you can update your code to include one-hot encoding:

```python
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# One-hot encode the 'embarked' column and add the new columns to the titanic DataFrame
embarked_dummies = pd.get_dummies(titanic['embarked'], prefix='embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

print(titanic.head())
```

### Explanation:
1. **One-hot encoding**: The `pd.get_dummies()` function is used to convert the 'embarked' column into multiple columns representing the presence of each unique value with a binary output. The `prefix='embarked'` argument adds a label to the beginning of each new column name, indicating the original column from which they were derived.
2. **Concatenation**: The `pd.concat()` function merges the original DataFrame, `titanic`, with the new DataFrame containing the one-hot encoded columns, `embarked_dummies`. The `axis=1` argument specifies that the concatenation should be column-wise (i.e., side by side).

This code will append three new columns to the `titanic` DataFrame, each representing one of the possible embarkation points (C = Cherbourg, Q = Queenstown, S = Southampton), with 1s and 0s indicating whether each row's 'embarked' value was that port.

Run the script to see the DataFrame with the newly added one-hot encoded columns. If you have any questions or need further guidance, the cosmos and I are here to help!