## Business Understanding 
Purpose: Ask relevant questions and define objectives for the problem that needs to be tackled

## Background
In recent years, the range of funding options for projects created by individuals and small companies has expanded considerably. In addition to savings, bank loans, friends & family funding and other traditional options, crowdfunding has become a popular and readily available alternative.

Kickstarter, founded in 2009, is one particularly well-known and popular crowdfunding platform. It has an all-or-nothing funding model, whereby a project is only funded if it meets its goal amount; otherwise no money is given by backers to a project. A huge variety of factors contribute to the success or failure of a project — in general, and also on Kickstarter. Some of these are able to be quantified or categorized, which allows for the construction of a model to attempt to predict whether a project will succeed or not. The aim of this project is to construct such a model and also to analyse Kickstarter project data more generally, in order to help potential project creators assess whether or not Kickstarter is a good funding option for them, and what their chances of success are.

### Final Deliverables


* Well designed presentation for non-technical stakeholders outlining findings and recommendations, as well as future work (10min presentation).
* Jupyter notebook following Data Science Lifecycle

### Things to think about

* Try different (at least 3) machine learning algorithms to check which performs best on the problem at hand
* What would be right performance metric: Precision, recall, accuracy, F1 score, or something else? (Check TPR?)
* Check for data imbalance


## Key Question 

We currently hold a task by Kickstarter to come up with a model to predict in a first step whether is project is likely to be successful, given certain project parameters. In a second step (out of scope), Kickstarter would like to be able to provide a good goal recommendation for creators( for example using staff picks etc.)

* Given certain project parameters, __is a campaign likely to succeed or fail?__ --> classification
* what would e a __reasonable goal reccomendation for creators__ --> regression



## Feature Glossary

Features included in model

* Target : state
*
*
*

## Dataset Description

- **backers_count**: Amount of people who backed this project
- **category**: 
- **country**: Country the project owner lives in
- **created_at**: Date when the prjoect was created
- **currency**: Currency of the country where the owner lives in
- **currency_trailing_code**: 
- **current_currency**: 
- **deadline**: Date until the project can be backed
- **disable_communication**: If the communication with owner was disabled or not
- **fx_rate**: Foreign exchange rate
- **goal**: Project is only funded when the goal amount is reached
- **launched_at**: Date when the project was launced
- **spotlight**: Highlighted projects (available to all projects that are successfully funded)
- **staff_pick**: Promissing project picked by Kickstarter employees
- **state**: Project status
- **state_changed_at**: Date when state changed the last time
- **static_usd_rate**: static USD Convergen rate at time
- **usd_pledged**: pledge amount converted to USD using Static_usd_rate


## Dataset New/Added Feature Description

- **campaign_days**: Days the Project was live
- **pledged_over**: Amount Pledged surpassing the Goal(with converted pledge amount) 
- **pre_launched_days**: Days before the Project was launched


## Target Metric

* F1 score — Since creators wouldn’t want the model to predict too many success that will turn out to be a failure (minimize False Positives) and backers would want to make sure the model capture as many success as possible (minimize False Negatives), I want a balance between precision and recall

## Outcome / Reccomendations
*
*
*

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.gridspec import GridSpec
import scipy as sc
from scipy.stats import kstest
import seaborn as sns
import math
import warnings
warnings.filterwarnings("ignore")

#Data mining
import os, glob

#Preprocessing
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
import imblearn
from imblearn.over_sampling import RandomOverSampler





## Dashboard
Purpose : Define global variables and visuals

In [None]:
random_state = 100
test_size = 0.3
sns.set(style = "white")

## Data Mining

In [None]:
# Import multiple Kickstarter csv files and merge into one dataframe

path = "data-2"
all_files = glob.glob(os.path.join(path, "*.csv"))

all_df = []
for f in all_files:
    df = pd.read_csv(f, sep=',')
    df['file'] = f.split('/')[-1]
    all_df.append(df)
    
merged_df = pd.concat(all_df, ignore_index=True, sort=True)

In [None]:
#
#merged_df = pd.read_csv('data-2/Kickstarter_all.csv') ### brauche ich wenn ich den Anderen Kram nicht laufen lassen will

## Inspection and Data Cleaning

In [None]:
merged_df.info()

In [None]:
#save the merged data as .zip
#compression_opts = dict(method='zip', archive_name='out.csv')  
#merged_df.to_csv('out.zip', index=False, compression=compression_opts)

In [None]:
# Display shape of "data"
merged_df.shape

In [None]:
merged_df.head()

In [None]:
merged_df.columns

In [None]:
merged_df.groupby('state').count()

In [None]:
pd.isnull(merged_df).sum()

## Data Handling

In [None]:
# create a dataset for Inspection
final = merged_df.copy()

### Dropping Data

In [None]:
drop_list = []

#### Dropping features with missing values

In [None]:
drop_missing_values = ['blurb', 'friends', 'is_backing', 'is_starred', 'permissions', 'usd_type', 'location']
drop_list.extend(drop_missing_values)
final = final.drop(drop_missing_values, axis = 1)


#### Dropping useless features 

In [None]:
drop_useless_features = ['creator', 'currency_symbol', 'name', 'photo', 'profile', 'slug', 'source_url', 'urls', 'file']
drop_list.extend(drop_useless_features)
final = final.drop(drop_useless_features, axis = 1)

#### Dropping redundant features

In [None]:
drop_redundant_features = ['pledged', 'usd_pledged']
drop_list.extend(drop_redundant_features)
final = final.drop(drop_redundant_features, axis = 1)

In [None]:
drop_list

#### Replacing features

In [None]:
def clean_category(DataFrame): 
    cat_list = []
    subcat_list = []
    for e in DataFrame.category:
        string_list = e.split(',')
        if '/' in string_list[2]:
            cat_list.append(string_list[2].split('/')[0][8:])
            subcat_list.append(string_list[2].split('/')[1][:-1])
        else:
            cat_list.append(string_list[2][8:-1])
            subcat_list.append('None')
    DataFrame['category'] = cat_list
    DataFrame['sub_category'] = subcat_list
    return DataFrame

In [None]:
modified_list = ['category','state']

In [None]:
final = clean_category(final)

In [None]:
final.category.unique()

In [None]:
#replace successful and failed with 1 and 0
final.state.replace(['successful','failed'], [1,0],inplace=True)
#

final.is_starrable = final.is_starrable.astype(int)
final.disable_communication = final.disable_communication.astype(int)
final.currency_trailing_code = final.currency_trailing_code.astype(int)
final.staff_pick = final.staff_pick.astype(int)
final.spotlight = final.spotlight.astype(int)
#drop live,susspended,cancelled
#final = final[final['state'] == [1,0]]
final = final.query('state == [1,0]')

### Time conversions



In [None]:
modified_list.extend(['launched_at', 'deadline', 'created_at', 'state_changed_at'])         

In [None]:
#converting unix time 
final.launched_at = pd.to_datetime(final.launched_at,unit='s',infer_datetime_format=True)
final.deadline = pd.to_datetime(final.deadline,unit='s',infer_datetime_format=True)
final.created_at = pd.to_datetime(final.created_at,unit='s',infer_datetime_format=True)
final.state_changed_at = pd.to_datetime(final.state_changed_at,unit='s',infer_datetime_format=True)

### Writing df changes

In [None]:
feature_list = list(merged_df.columns)

df_features = pd.DataFrame(feature_list,columns =['features'])
df_features['dropped'] = df_features.features.isin(drop_list)
df_features['drop_reason'] = ['missing_values' if x in drop_missing_values \
                              else 'useless' if x in drop_useless_features \
                              else 'redundant' if x in drop_redundant_features \
                              else 'None' for x in df_features['features']]
df_features['modified'] = df_features.features.isin(modified_list)

In [None]:
df_features

# Data Exploration
Purpose: we gotta form a hypotheses / story about our defined problem by visually analyzing the data

In [None]:
#new dataset for exploration
data_exp = final.copy()

In [None]:
#years
#final['launched_at_yr'] = [date.year for date in final['launched_at']]

In [None]:
final.info()

In [None]:
# Seperate continious vs. categorical variables
data_cat_col = ['category','country','sub_category','country','currency','current_currency','is_starrable','disable_communication','state']
data_cont_col = [x for x in final if x not in data_cat_col]
data_cat = final[data_cat_col]
data_cont = final[data_cont_col]

In [None]:
# Check if scaling is needed ( we can do this by looking at the .skew()
final.skew()

In [None]:
#Plot correlation heatmap for continious values
mask = np.triu(np.ones_like(data_cont.corr(), dtype=np.bool))
f, ax = plt.subplots(figsize=(11, 9))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(data_cont.corr(), mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt=".1g");


In [None]:
#Plot a histogram of our Target 'state' and see if it needs scaling for future work
data_exp['state'].value_counts(ascending=True).plot(kind='bar')

* imbalanced data!!

In [None]:
plt.figure(figsize=(14,8))
sns.countplot(x='launched_at_yr', hue='state', data=data_exp);

In [None]:
plt.figure(figsize=(14,10))
sns.countplot(x='category', hue='state', data=data_exp);

# Feature Engineering

In [None]:
#create new features
final['pledged_over'] = final.converted_pledged_amount - final.goal 
final['campaign_days'] = final.deadline - final.launched_at 
final['pre_launched_days'] = final.launched_at - final.created_at

final['launched_at_yr'] = [date.year for date in final['launched_at']]

final["goal_converted"] = final["goal"] * final["static_usd_rate"]
#use log on goal_converted

In [None]:
#use log on stuff
final['goal_converted_log'] = [math.log(el) for el in final['goal_converted']]
final['converted_pledged_amount_log'] = np.log(final['converted_pledged_amount'])
final['backers_count_log'] = np.log(final['backers_count'])

# Preprocessing (Train/Test Split and Basemodel)
In order to apply modelling on different dataset types, we should consider a nice way to do the splits.



In [None]:
#define predictors and target variable X,y
X = final.drop(["state"], axis=1)
y = final["state"]

In [None]:
#Split data into training and testing sets
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=test_size,
                                                        random_state=random_state,
                                                        shuffle=True,
                                                   stratify=y)

In [None]:
X_train.info()

In [None]:
# create a dummy classifier model as Basemodel
dummy_clf = DummyClassifier(strategy='constant',constant=0).fit(X_train,y_train)
y_pred_dum_clf = dum_clf.predict(X_test)
dummy_clf.score(X_test, y_test)
print(confusion_matrix(y_test,y_pred_dum_clf))
print(classification_report(y_test,y_pred_dum_clf))


In [None]:
#for future work
#scores = cross_val_score(dummy_clf, X_train, y_train, scoring='f1', cv=10, n_jobs=-1)

In [None]:
#use oversampling

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority', random_state=random_state)
# fit and apply the transform
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)



In [None]:
X_train.info()

In [None]:
#use standard scaler on X_train and y_train
scaler = StandardScaler()
X_train[data_cont_col] = scaler.fit_transform(X_train[data_cont_col]) # Scaler is fitted to training data _only_
X_test[data_cont_col] = scaler.transform(X_test[data_cont_col]) # Already fitted scaler is applied to test data


#data_cat_col = ['category','country','sub_category','country','currency','current_currency','is_starrable','disable_communication']
#data_cont_col = [x for x in final if x not in data_cat_col]

In [None]:
#use standard scaler on X_train_over and y_train_over

# Predictive Modelling
Purpose: Train machine learning models (supervised), evaluate their performance and use them to make predictions

In [None]:
#logistic regression

In [None]:
#Random Forest Classifier

In [None]:
#Support Vector Machines (use classifier)

In [None]:
#maybe AdaBoost

# Ensemble Methods

In [None]:
#use KNN,SVC,DTC,Randomforestclassifier,XGB....

# Future Work

In [None]:
#use maybe RandomizedSearchCV on RandomForest or any given Algorithm

# Data Visualisation
Purpose: Communicate the findings with stakeholders using plots and interactive visualisations

# Findings 
Purpose: Summarize the key results and findings

# Future Work

In [None]:
# To do: save final df as csv
#compression_opts = dict(method='zip', archive_name='Kickstarter_all_clean.csv')  
#final.to_csv('Kickstarter_all_clean.zip', index=False, compression=compression_opts)