# Dummies for dummies

Here's an example of how to use pandas `get_dummies` function with some dummy data:

Suppose you have a dataset with a column called "fruits" that contains categorical data on different types of fruits.

```json
{
  "fruits": [
    "apple", 
    "banana", 
    "apple", 
    "orange", 
    "banana", 
    "pear"
  ]
}
```

Here's how you can use `get_dummies` to create a one-hot encoded representation of the data:

In [1]:
import pandas as pd

# That's the column, that's the data IN that column -
data = {'fruits': ['apple', 'banana', 'apple', 'orange', 'banana', 'pear']}

# Create pandas dataframe
df = pd.DataFrame(data)
df

Unnamed: 0,fruits
0,apple
1,banana
2,apple
3,orange
4,banana
5,pear


In [2]:
# Use get_dummies to one-hot encode categorical column
one_hot_df = pd.get_dummies(df['fruits'])

# Print one-hot encoded dataframe
one_hot_df

Unnamed: 0,apple,banana,orange,pear
0,1,0,0,0
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0
4,0,1,0,0
5,0,0,0,1


In [3]:
data_train = pd.concat([df, one_hot_df], axis=1)
data_train

Unnamed: 0,fruits,apple,banana,orange,pear
0,apple,1,0,0,0
1,banana,0,1,0,0
2,apple,1,0,0,0
3,orange,0,0,1,0
4,banana,0,1,0,0
5,pear,0,0,0,1


As you can see, the `get_dummies` function has created a new column for each unique value in the "fruits" column and assigned a binary value of 1 or 0 to each row depending on whether it contains that value or not. This one-hot encoded representation can be useful for machine learning tasks where categorical data needs to be converted to a numerical format.

In [7]:
# Then this is what Kay Jan Wong was talking about -
data1 = {'fruits': ['apple', 'banana', 'apple', 'orange', 'banana', 'pear']}
data_train = pd.DataFrame(data1)

data_ohe = pd.get_dummies(data_train["fruits"])

data_train = pd.concat([data_train, data_ohe], axis=1)
data_train

Unnamed: 0,fruits,apple,banana,orange,pear
0,apple,1,0,0,0
1,banana,0,1,0,0
2,apple,1,0,0,0
3,orange,0,0,1,0
4,banana,0,1,0,0
5,pear,0,0,0,1


# Categorical data

## one-hot encoded representation

This means that it creates a new column for each unique category in a categorical variable and assigns a binary value of 1 or 0 to each row depending on whether it contains that category or not. This one-hot encoded representation is often used as input for machine learning models or for further data analysis.

## You need to tell it what kind of encoding you have used

Yes, when using one-hot encoding with machine learning algorithms, it's important to inform the algorithm that the data has been one-hot encoded so that it can appropriately handle the resulting features.

Most machine learning libraries and frameworks have built-in functions that can handle one-hot encoded data.

For example, scikit-learn has functions like `OneHotEncoder` that can be used to one-hot encode categorical data, and the resulting one-hot encoded data can be fed directly into machine learning models.

However, it's also important to keep in mind that one-hot encoding **can increase the dimensionality of the data**, which can lead to issues with overfitting and computational complexity.

In some cases, it may be better to use other encoding methods like **label encoding** or **ordinal encoding** depending on the specific requirements of the problem.