Kaggle-Kickstarter-Project-Status-Prediction

The aim of the project is to predict the state of the Kickstarter projects (as 'Successful' and 'Failed') before its actual deadline

Dataset:

The Dataset provided to us has projects till 2017, starting from 2009. It is at a project ID level and has 331675 rows (331675 projects).

Understanding Variables in the Dataset

The dataset has 15 variables including ID. Since ID is the level of the dataset, we can set it as the index of the ata later. Variables like name, currency, deadline, launched date and country as self explanatory. Explanations of some key variables are as follows:

Main_Category: There are 15 main categories for the project. These main categories broadly classify projects based on topic and genre they belong to.

Category: Main Categories are further sub divided in categories to give more general idea of the project. For example, Main Category “Technology” has 15 categories like Gadgets, Web, Apps, Software etc. There are 159 total categories.

Goal: This is the goal amount which the company need to raise to start its project. The goal amount is important variable for company as if it is too high, the project may fail to raise that amount of money and be unsuccessful. If it is too low, then it may reach its goal soon and backers may not be interested to pledge more.

Pledged: This is amount raised by the company through its backers. On Kickstarter, if total amount pledged is lower than goal, then the project is unsuccessful and the start-up company doesn’t receive any fund. If pledged amount is more than the goal, the company is considered successful. The variable “usd pledged” is amount of money raised in US dollars.

Number of Backers: These are number of people who have supported the project by pledging some amount.

What are the steps followed?

EDA and data understanding
Feature engineering and manual (heuristic) feature selection
Model Building and Predictions
Feature Importance evaluation- to get the key drivers

Models tested

Logistic Regression with Grid Search
XGBoost
Random Forest

Best model

Best Accuracy:

first iteration: 68.9 %

second iteration: 69.2 %

third iteration: 70.3%

Best Model: 1st and 2nd runs: XGBoost 3rd run: LGBM

Key Drivers (top 15):

The below list has the top 15 features the corresponding importance from XGBoost (first Iteration)

duration-0.111218 %duration is the difference between deadline and launch date
participants-0.091028 %number of projects launched in the same year-quarter with the same goal bucket in the same category
avg_success_rate-0.084386 %probability of success of project on the basis of pledge (pledge per backer) and goal amount of similar projects in the project year
launched_month-0.075908
avg_ppb-0.070271 #average pledge per backer of similar projects (same category) in the given year
launched_quarter-0.063191
goal-0.060700
usd_goal_real-0.056942
launched_year-0.044968
goal_cat_perc-0.038937 %percentile bucket of goal
currency_USD-0.035004
currency_GBP-0.009265
main_category_Film_and_Video-0.007648
country_US-0.007604
main_category_Technology-0.005463

The below list has the top 15 features the corresponding importance from XGBoost (second Iteration)

duration 0.089933
name_len 0.067842 % length of names of the project
participants 0.067755
name_words 0.066489 % number of words in the characters
avg_success_rate 0.064787
launched_month 0.056317
avg_ppb 0.054222
launched_quarter 0.046669
usd_goal_real 0.036803
goal 0.033310
goal_log 0.029468 % natural log of goal to shrink the scale
launched_year 0.025452
currency_USD 0.024884
goal_cat_perc 0.023880
Goal_1000 0.018772 %goal divided by 1000 to classify the ranges

The below list has the top 15 features the corresponding importance from LGBM (third Iteration)

category 554198.777930
avg_success_rate_goal 524201.986343
duration 227475.493064
usd_goal_real 183247.092781
avg_success_rate_duration 79535.630888
name_words 79464.149268
launched_year 70389.641850
name_len 59306.375015
goal 49782.070504
country 44977.135745
main_category 44118.396859
participants_qtr 31245.759655
avg_ppb_goal 27836.024058
launched_week 22932.222815
mean_goal_year 16445.190139

Contents of the repository

The file 'kickstarter_project_predictions_final_version_0109.ipynb' contains the first iteration with best accuracy of 68.9%.

The file 'Kernel.ipynb' contains the second iteration with best accuracy of 69.2%

The file 'kickstarter_final_run_lgbm703.ipynb' contains the third iteration with best accuracy of 70.3% by Light GBM model

What are the additional steps followed in the second iteration?

I have created additional features using the name of the project and the goal amount. The features have been explained in the notebook itself. Although I tried ensembling models in the second iteration using a normal averaging approach and boosting(AdaBoosting), there was no improvement in the performance.

What are the additional steps followed in the third iteration?

Additional features were created around duration and number of participants. These features only improved the accuracy by 0.3% to 69.5%.

I tried to execute the Randomized Search on XGBoost but stopped it due to the run time. I tried tweaking parameters (learning_rate, n_estimators and max_depth) manually but it did not make anychange to the model performance

I then executed the LGBM in 2 different ways:

A. LGBM with one-hot encoded categorical features

B. LGBM with categorical features (category, main_category, currency, country) converted to integer value category-type columns and fed to the LGBM model using the 'categorical_feature' argument in the 'fit' function

This brings the current accuracy to 70.3%

Links

The Kaggle kernel for the first iteration can be found here

The Kaggle kernel for the second iteration is here

The Kaggle kernel for the third iteration is here

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
kernel.ipynb		kernel.ipynb
kickstarter_final_run_lgbm703.ipynb		kickstarter_final_run_lgbm703.ipynb
kickstarter_project_predictions_ final_version_0109.ipynb		kickstarter_project_predictions_ final_version_0109.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle-Kickstarter-Project-Status-Prediction

Dataset:

Understanding Variables in the Dataset

What are the steps followed?

Models tested

Best model

Best Accuracy:

first iteration: 68.9 %

second iteration: 69.2 %

third iteration: 70.3%

Contents of the repository

What are the additional steps followed in the second iteration?

What are the additional steps followed in the third iteration?

Links

About

Releases

Packages

Languages

srishtis/Kaggle-Kickstarter-Project-Status-Prediction

Folders and files

Latest commit

History

Repository files navigation

Kaggle-Kickstarter-Project-Status-Prediction

Dataset:

Understanding Variables in the Dataset

What are the steps followed?

Models tested

Best model

Best Accuracy:

first iteration: 68.9 %

second iteration: 69.2 %

third iteration: 70.3%

Contents of the repository

What are the additional steps followed in the second iteration?

What are the additional steps followed in the third iteration?

Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages