**This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/categorical-variables).**

---


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/home-data-for-ml-course/sample_submission.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/test.csv


In [2]:
X_train = pd.read_csv('/kaggle/input/home-data-for-ml-course/train.csv',index_col='Id')
X_test = pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv', index_col='Id')

In [3]:
X_train.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_train.SalePrice
X_train.drop(['SalePrice'], axis=1, inplace=True)

In [4]:
nobject_cols = [col for col in X_test.columns if X_test[col].dtype != "object"]
object_cols = [col for col in X_test.columns if X_test[col].dtype == "object"]
obj_columns = [ col for col in object_cols if X_test[col].isnull().any()]
nobj_columns = [ col for col in nobject_cols if X_test[col].isnull().any()]
low_cardinality_cols = [col for col in object_cols if X_test[col].nunique() < 10]

In [12]:
from sklearn.preprocessing import OneHotEncoder
X_obj = X_train.copy()
X_test_obj = X_test.copy()
oh_encoder =OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(oh_encoder.fit_transform(X_obj[low_cardinality_cols])) 
OH_cols_test = pd.DataFrame(oh_encoder.transform(X_test_obj[low_cardinality_cols])) 
OH_cols_train.index = X_obj.index
OH_cols_test.index = X_test_obj.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_obj.drop(object_cols, axis=1)
num_X_test = X_test_obj.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
X_obj = pd.concat([num_X_train, OH_cols_train], axis=1)
X_test_obj = pd.concat([num_X_test, OH_cols_test], axis=1)


Int64Index([1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470,
            ...
            2910, 2911, 2912, 2913, 2914, 2915, 2916, 2917, 2918, 2919],
           dtype='int64', name='Id', length=1459)

In [16]:
from sklearn.impute import SimpleImputer

X_nobj_plus = X_obj.copy()
X_test_nobj_plus = X_test_obj.copy()

for col in nobj_columns:
    X_nobj_plus[col + '_was_missing'] = X_nobj_plus[col].isnull()
    X_test_nobj_plus[col + '_was_missing'] = X_test_nobj_plus[col].isnull()

impute = SimpleImputer()
final_X_nobj = pd.DataFrame(impute.fit_transform(X_nobj_plus))
final_X_test_nobj = pd.DataFrame(impute.transform(X_test_nobj_plus))

# Fill in the lines below: imputation removed column names; put them back
final_X_nobj.columns = X_nobj_plus.columns
final_X_test_nobj.columns = X_test_nobj_plus.columns
final_X_nobj.index = X_obj.index
final_X_test_nobj.index = X_test_obj.index

In [17]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100,random_state=0)
model.fit(final_X_nobj,y)
pred = model.predict(final_X_test_nobj)
output = pd.DataFrame({'Id': final_X_test_nobj.index,
                       'SalePrice': pred})
output.to_csv('submission.csv', index=False)

In [18]:
output

Unnamed: 0,Id,SalePrice
0,1461,127014.58
1,1462,155681.00
2,1463,178072.01
3,1464,181484.90
4,1465,199703.96
...,...,...
1454,2915,84244.11
1455,2916,86480.50
1456,2917,151158.99
1457,2918,112947.00


# Keep going

With missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called **pipelines**. 

**[Learn to use pipelines](https://www.kaggle.com/alexisbcook/pipelines)** to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161289) to chat with other Learners.*