**Introduction**

- Develop a baseline model for comparing performance on models with more features
- Encode categorical features so that model can make better use of the information
- Generate new features to provide more information for the medel
- Select features to reduce overfitting and increase prediction speed

In [None]:
import pandas as pd
ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
ks.head(10)

**Preparing target column**

First I'll look at project states and convert the column into something we can use as targets in a model.

In [None]:
pd.unique(ks.state)

we have six states, how many records of each?

In [None]:
ks.groupby('state')['ID'].count()

In [None]:
#Drop live projects
ks = ks.query('state != "live"')

#Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome= (ks['state'] == 'successful').astype(int))

**Converting timestamps**

In [None]:
ks = ks.assign(hour = ks.launched.dt.hour,
              day = ks.launched.dt.day,
              month = ks.launched.dt.month,
              year = ks.launched.dt.year)
ks.head()

**Prepping categorical variables**

Now for the categorical variables -- `category`, `currency`, and `country` -- I'll need to convert them into integers so our model can use the data. For this I'll use scikit-learn's `LabelEncoder`.This assigns an integer to each value of the categorical feature and replaces those values with the integers.

In [None]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encode = LabelEncoder()

#Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transfrom)
encoded.head()

I'll collect all the features we'll use in a new dataframe and use that to train a model.

In [None]:
#Since ks and encode have the same index and I can easily join them
data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)
data.head()

**Creating training, validation, and test splits**

In [None]:
valid_fraction= 0.1
valid_size = int(len(data) * valid_fraction)

train = data[:-2 * valid_size]
valid = data[-2 * valid_size: -valid_size]
test = data[-valid_size:]

In general you want to be careful that each data set has the same proportion of target classes.I'll print out the fraction of successful outcomes for each of our datasets.

In [None]:
for each in [train, valid, test]:
    print(f"Outcome fraction = {each.outcome.mean():.4f}")

This looks good, each set is around 35% true outcomes likely because the data was well randomized beforehand. A good way to do this automatically is with 

`sklearn.model_selection.StratifiedShuffleSplit` but I don't need to use it here.

**Training a LightGBM model**

For this course we'll be using a LightGBM model. This is a tree-based model that typically provides the best performance, even compared to XGBoost, It's also relatively fast to train. We won't do hyperparemeter optimization because that isn't the goal of this course. So, our models won't be the absolute best performance you can get. But you'll still see model performance improve as we do feature engineering.

In [None]:
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label= train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label= valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets= [dvalid], early_stopping_
               rounds= 10, verbose_eval= False)

**Making predictions& evaluating the model**

Finally, let's make predictions on the test set with the model and see how well it preforms.An importment thing to remember is that you can overfit to the validation data. This is way we need a testset that the model never sees until the final evaluation.

In [None]:
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")