In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data_path = "../data/notebooks/4_merged_data.csv"

In [3]:
df = pd.read_csv(data_path)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df.shape

(165006, 38)

- **The reason we are not filling missing values with constant is because we are unlikely to make prediction in batch in test environment rather on single  instances, for this case we would not get null values in test environment**
- **Days to Delivery is not being used as a feature because, there are significant number of null/corrupt(scraping issues) values and by further research it's been found out that creator of the campaign is not only allowed to change the date over time but is also not contingent to deliver by the provided date(which is seems to be the case more often)**


In [5]:
cols = ['launched_at',  'status', 'days_to_deadline', 'goal',
       'sub_category', 'category', 'blurb_length', 'location_country',  'rewards_mean', 'rewards_median',
       'rewards_variance', 'rewards_SD', 'rewards_MIN', 'rewards_MAX' , 'rewards_NUM', 'currency', 'creator_created', 'creator_backed', 'launch_year', 'launch_month', 'deadline_year', 'deadline_month']
df = df[cols]
df= df.dropna(axis=0, subset=["rewards_MIN"])
df= df.dropna(axis=0, subset=["blurb_length"])
df['creator_created'].fillna(0, inplace=True)
df['creator_backed'].fillna(0, inplace=True)
df.sort_values("launched_at" , inplace=True)

df = df.reset_index(drop=True)


In [6]:

df["launched_at"]  = pd.to_datetime(df["launched_at"]).dt.date
df.sort_values("launched_at" , inplace=True)


In [7]:
df.drop(['launched_at'] ,axis=1 , inplace=True)
df.reset_index(inplace=True)
df.drop('index', inplace=True , axis=1)

**Creator Backed feature seems to be of string type with some of them having commas let's fix that**

In [8]:
def getBakced(x):
    try:
        return  float(x)
    except:
        return float(x.replace(",", ""))
df['creator_backed'] = df["creator_backed"].apply(getBakced)

In [9]:
from sklearn.preprocessing import LabelBinarizer

binarizer= LabelBinarizer()
df["status"] = binarizer.fit_transform(df["status"])



In [10]:
from sklearn.preprocessing import OneHotEncoder
encoder  = OneHotEncoder(sparse=False)
cat_cols=['category', 'sub_category', 'currency', 'location_country']
X_hot = encoder.fit_transform(df[cat_cols])


onehotcols = []
for cat in encoder.categories_:
    for col in cat:
        onehotcols.append(col)

X_hot = pd.DataFrame(X_hot , columns=onehotcols)
df =pd.concat([df , X_hot] , axis=1)

In [11]:
df.drop(cat_cols , axis=1 , inplace=True)

In [12]:
train_years =[2014 ,2015, 2016, 2017]

valid_years = [2018]

df_train = df[df['launch_year'].apply(lambda x: True if x in train_years else False)]
df_valid= df[df['launch_year'].apply(lambda x: True if x in valid_years else False)]

In [13]:
X_train , y_train = df_train.drop("status" , axis=1) , df_train['status']
X_valid , y_valid = df_valid.drop("status" , axis=1) , df_valid['status']

In [14]:
from sklearn.ensemble import RandomForestClassifier
import operator


def score(X_train, X_test, y_train, y_test):
    rf_fet = {}
    from sklearn.ensemble import RandomForestClassifier
    rf= RandomForestClassifier(n_estimators=100, random_state=13579)
    rf.fit(X_train, y_train)
    rf_score = rf.score(X_test, y_test)
    
   
    feat_labels = X_train.columns.values
    
    for feature, acc in zip(feat_labels, rf.feature_importances_):
        rf_fet[feature] = acc
        
    rf_fet =  sorted(rf_fet.items(), key=operator.itemgetter(1), reverse=True)
  
        
    return (rf,rf_score, rf_fet)





In [15]:



_ , scores , _ = score(X_train , X_valid , y_train , y_valid)
print(scores)

0.8138555870778805


**Finally we have achieved 80% an acceptable accuracy**

**But there are certain problems with few features**
- Creators backed and creators created have values of the time when it was scrapped not at the time when the creator created the project this can lead to data leakage so we need to remove this column
- Random Forest, which is the model we are using is good at interpolation but not as good with extrapolation so any feature having temporal component has negative effect on the model (So let us try removing feature like launch_month deadline_month etc)
- We also have to discard launch year and deadline year because
    - We would be training our model every month ideally and testing it for the next so the model would be always up-to date and not tested on previous year's data, in which case there would be no use in knowing years of launch and deadline
    - Temporal variables are best to avoid when using Models like Random Forest as they are not good at interpolation

In [16]:
drop_cols = ['creator_created', 'creator_backed']

In [17]:
X_train.drop(drop_cols , axis=1  ,inplace=True)
X_valid.drop(drop_cols , axis=1  ,inplace=True)

In [18]:



_ , scores , _ = score(X_train , X_valid , y_train , y_valid)
print(scores)

0.7942253134437631


**After removing the columns which i speculate to cause data lekage we did not yet hit our sweet mark, let us do some EDA and see if we find some thing interesting**

- **Before making decision about complicated feature engineering techniques it should be kept in mind that objective of our project is to make the model as interpretable and simulational(to say) as possible, so revisiting the product that we've decided upon**

- **It is clear that there are some variables which are fixed by the user and  the other variable's subject to change would be exploited therefore for the reasons of Explainability let us minimize advance feature engineering and confine it to those fixed variables**

- **So we can not perform some advance feature engineering techniques on the variables such that they transform into numbers which become unrecognizable**
- **Having said that we are still free to perform Feature engineering on static categorical variables, so let us experiment with some encoding techniques**

### Categorical Variables Encoding

In [85]:
import pandas as pd
import category_encoders as ce
df = pd.read_csv(data_path)
df_raw = df.copy()

  interactivity=interactivity, compiler=compiler, result=result)


In [86]:
target_encoding_cols = ['location_country' , 'currency' , 'category', 'sub_category']

In [87]:
def calc_smoothning(df , by , on , m):
    mean =  df[on].mean()
    
    agg = df.groupby(by)[on].agg(['count', 'mean'])
    counts = agg['count']
    means = agg['mean']
    
    smooth = (counts *means +m*mean) /(counts*m)
    
    return df[by].map(smooth)

In [88]:
cols = ['launched_at',  'status', 'days_to_deadline', 'goal',
       'sub_category', 'category', 'blurb_length', 'location_country',  'rewards_mean', 'rewards_median',
       'rewards_variance', 'rewards_SD', 'rewards_MIN', 'rewards_MAX' , 'rewards_NUM', 'currency', 'launch_year', 'launch_month', 'deadline_month']
df = df[cols]
df= df.dropna(axis=0, subset=["rewards_MIN"])
df= df.dropna(axis=0, subset=["blurb_length"])
df.sort_values("launched_at" , inplace=True)

df = df.reset_index(drop=True)


df["launched_at"]  = pd.to_datetime(df["launched_at"]).dt.date
df.sort_values("launched_at" , inplace=True)

df.drop(['launched_at'] ,axis=1 , inplace=True)
df.reset_index(inplace=True)
df.drop('index', inplace=True , axis=1)



from sklearn.preprocessing import LabelBinarizer

binarizer= LabelBinarizer()
df["status"] = binarizer.fit_transform(df["status"])






In [89]:
def helmert_encoding(X_train , X_valid ,target_encoding_cols ):
    encoder = ce.HelmertEncoder(cols = target_encoding_cols , drop_invariant=True )
    dfh = encoder.fit_transform(X_train[target_encoding_cols])
    X_train = pd.concat([X_train , dfh], axis=1)
    X_train.drop(target_encoding_cols , axis=1 , inplace=True)
    dfh = encoder.transform(X_valid[target_encoding_cols])
    X_valid = pd.concat([X_valid , dfh], axis=1)
    X_valid.drop(target_encoding_cols , axis=1 , inplace=True)
    _ , scores , _ = score(X_train , X_valid , y_train , y_valid)
    print("Helmert Encoding: "+str(scores))
    

In [90]:
def binary_encoding(X_train , X_valid ,target_encoding_cols ):
    encoder = ce.BinaryEncoder(cols = target_encoding_cols , drop_invariant=True )
    dfh = encoder.fit_transform(X_train[target_encoding_cols])
    X_train = pd.concat([X_train , dfh], axis=1)
    X_train.drop(target_encoding_cols , axis=1 , inplace=True)
    dfh = encoder.transform(X_valid[target_encoding_cols])
    X_valid = pd.concat([X_valid , dfh], axis=1)
    X_valid.drop(target_encoding_cols , axis=1 , inplace=True)
    _ , scores , _ = score(X_train , X_valid , y_train , y_valid)
    print("Binary Encoding: "+str(scores))
    

In [91]:
def cat_encoding(X_train , X_valid ,target_encoding_cols, y_train ):
    encoder = ce.CatBoostEncoder(cols = target_encoding_cols  )
    dfh = encoder.fit_transform(X_train[target_encoding_cols] , y_train)
    X_train = pd.concat([X_train , dfh], axis=1)
    X_train.drop(target_encoding_cols , axis=1 , inplace=True)
    dfh = encoder.transform(X_valid[target_encoding_cols])
    X_valid = pd.concat([X_valid , dfh], axis=1)
    X_valid.drop(target_encoding_cols , axis=1 , inplace=True)
    _ , scores , _ = score(X_train , X_valid , y_train , y_valid)
    print("CatBoostEncoder: "+str(scores))
    

In [92]:


train_years=[2015, 2016, 2017]
valid_years = [2018]
df_train = df[df['launch_year'].apply(lambda x: True if x in train_years else False)]
df_valid= df[df['launch_year'].apply(lambda x: True if x in valid_years else False)]
X_train , y_train = df_train.drop(["status","launch_year"] , axis=1) , df_train['status']
X_valid , y_valid = df_valid.drop(["status","launch_year"] , axis=1) , df_valid['status']



In [93]:
helmert_encoding(X_train , X_valid , target_encoding_cols)

Helmert Encoding: 0.8019584515420518


In [94]:
binary_encoding(X_train , X_valid ,target_encoding_cols )

Binary Encoding: 0.793264390958177


In [95]:
cat_encoding(X_train , X_valid ,target_encoding_cols , y_train)

CatBoostEncoder: 0.7433879381348952


### Handling Categorical Unkown


- **There is actually one thing we need to make sure which comes up during production of the model, which is to handle unknown categorical variables(We already handled outliers and missing values)**
- **Since the way we handle them would impact the model we need to do it before hyperparameter optimization and model selection**
- **As encoding categorical variables is being carried out by herlmert encoder, let's see how it handles it**

In [31]:
import category_encoders as ce

encoder = ce.HelmertEncoder()


In [32]:
encoder.fit_transform(['statistics', 'datascience', 'machine learning'])

Unnamed: 0,intercept,0_0,0_1
0,1,-1.0,-1.0
1,1,1.0,-1.0
2,1,0.0,2.0


In [33]:
encoder.transform(['deeplearning'])

Unnamed: 0,intercept,0_0,0_1
0,1,0.0,0.0


- **As we can see above it does not throw error so we would not have any problem in production**
- **Probability of having new category is low anyways so it is not something to worry about**

### Conclusions

- **Finally we've achieved acceptable score of 80% with Helmert Encoding**

- **But for interpretibility reasons we will also have to include one_hot_encoding(You will as to why that is further down the post)**

- **There are plenty of feature engineering we could have done which would have boosted the accuracy of the model but that would transform our features such that it would not be possible to interpret which in-turn does not align with our business objective**



- **Till now we have been trying out just Random Forest, in the next notebook let us try XGboost in the next notebook**