**There's a lot of non-numeric data out there. Here's how to use it for machine lerning**

In this tutorial, you will learn what a catagorical variable is, along with three approcches for handing this type of data.

[来自](https://www.kaggle.com/alexisbcook/categorical-variables)

- Three Approaches

1. **Drop Categorical Variables**

The easiest approach to dealing with categorical variables is to simply remove them from the dataset.

*This approach will only work well if the columns did not contain usefull information.*

2. **Label Encoding**

**Label encoding** assigns each unique value to a different integer.

we refer to those that do as **ordinal variables**

3. **One-Hot Encoding**

**One-hot encoding** creates new columns indicating the presence (or absence) of each possible value in the original data

In [None]:
#第一步：
# Get list of categorical variables
s = (X_train.dtype == 'object')
object_cols = list(s[s].index)

print("Categorial variables: ")
print(object_cols)

**Define Function to Measure Quality of Each Approach**

In [1]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators= 100, random_state= 0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

**1. Score from Approach 1 (Drop Categorical Varibles)**

In [None]:
drop_X_train = X_train.select_dtypes(exclude= ['object'])
drop_X_valid = X_valid.select_dtypes(exclude= ['object'])

print("MAE from Approach 1 (Drop categorical variables): ")
print(score_dataset(drop_X_train, drop_X_valid,y_train, y_valid))

**2. Score from Approach 2 (Label Encoding)**

Scikit-learn has a `labelEncode` class that can be used to get label encodings. We loop over the categorical variables and apply the label encoder separately to each column.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorial data
label_encoder = LabelEncoder()

for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])
    
print("MAE from Approach 2 (Label Encoding): ")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))



**3. Score from Approach 3 (One-Hot Encoding)**

We use the `OneHotEncoder` class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.

 - We set `handle_unknown='ignore'` to avoid errors when the validation data contains classes that aren't represented in the training data, and
 
 - setting `sparse=False` ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data

OH_encoder = OneHotEncoder(handle_unknown= 'ignore', sparse= False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_vaild = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

#One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_vaild.index = X_train.index

#Remove categorical columns(will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis= 1)
num_X_valid = X_valid.drop(object_cols, axis= 1)

#Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis= 1)
OH_X_valid = pd.concat([num_X_vaild, OH_cols_valid], axis= 1)

print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

**Conclusion**

The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!

In [None]:
# Fill in the line below: preprocess test data
final_X_test = pd.DataFrame(final_imputer.transform(X_test))

# Fill in the line below: get test predictions
preds_test = model.predict(final_X_test)

In [None]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)