# Preprocessing Data and Pipelines

## Dealing with Categorical Features
- Sklearn won't accept categorical features by default
- We need to encode them as numbers
- Dummy variables:
    - Imagine that a dataframe cars has a column Origin (Asia, US, Europe)
    - Create three columns origin_Asia, origin_US, origin_Europe
    - Then If the origin is Asia, origin_Asia should be 1; while others should be 0.

In [None]:
## Dummy variables by pandas get_dummies() ##

df = pd.read_csv("somecsvfile.csv")
df = pd.get_dummies(df, drop_first=True)
# By making drop_first true, we get rid of duplicate information.
# If we remove the column origin_Asia:
    # origin_US and origin_Europe being zero will automatically indicate that the car is Asian

## Imputing Missing Data
- Making an educated guess about the missing values

In [None]:
from sklearn.preprocessing import Imputerr

imp = Imputer(missing_values="NaN", strategy="mean", axis=0)
# missing_values: Which data to impute?
# strategy: How? (Should we impute with the mean etc.)
# axis: 0 if column-wise, 1 if row-wise
imp.fit(X)
X = imp.transform(X)

### Imputing within a Pipeline

In [None]:
from sklearn.pipeline import Pipeline

imp = Imputer(missing_values="NaN", strategy="mean", axis=0)
logreg = LogisticRegression()

# A pipeline automatically fits the imputer and does the transformation
pipeline = Pipeline([("imputation", imp), ("logistic_regression", logreg)])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

## Centering and Scaling (Normalization)

### Why scale your data?
- Many models use some form of distance
- Features on larger scales can unduly influence your model
- We want features to be on a similar scale

### Ways to normalize your data
- **Standardization:** Subtract the mean and divide by variance
    - So that all features are centered around 0 and have variance 1.
- Can also subtract the minimum and divide by the range
- The range might be (0, 1) or (-1, 1)

In [None]:
from sklearn.preprocessing import scale, StandardScaler

X_scaled = scale(X)

# Scaling in a pipeline
Pipeline = ([("scaler", StandardScaler()), ("knn", KNeighborsClassifier())])

## Using CV in a Pipeline

In [None]:
steps = [("scaler", StandardScaler()), ("knn", KNeighborsClassifier())]

pipeline = Pipeline(steps)
parameters = {knn__n_neighbors: np.arange(1, 50)}  # Notice the two underscores after the model

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)