# Final Project
# Kickstarter: Predicting the success/failure of crowdfunded projects
Ryan Ly
*****

### Data Cleaning
For the first part of this project, the data is imported, concatenated into one dataframe (because the scraped data consists of 56 separate csv files), and and cleaned in preparation for exploratory data analysis and modeling. Because the data requires a significant amount of cleaning, it was easier to process it using Pandas first and then used the cleaned dataset to train the machine learning models using sklearn and pyspark later.

In [None]:
# Import all required packages
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50) # Display up to 50 columns at a time
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import cm
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12,5
import seaborn as sns
sns.set()
sns.set_palette("husl")

import glob # To read all csv files in the directory

import calendar
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, f1_score, precision_recall_fscore_support
import itertools
import time
import xgboost as xgb

In [None]:
# Combine all Kickstarter csvs
df = pd.concat([pd.read_csv(f) for f in glob.glob('data/Kickstarter*.csv')], ignore_index = True)

In [None]:
df.head()

In [None]:
# Total number of projects
len(df)

In [None]:
# Checking the columns (features)
df.columns

Column descriptions:
- backers_count = number of people who contributed funds to the project
- blurb = short description of the project
- category = contains the category and sub-category of the project
- converted_pledged_amount = amount of money pledged, converted to the currency in the 'current_currency' column
- country = country the project creator is from
- created_at = date and time of when the project was initially created on Kickstarter
- creator = name of the project creator and other information about them, e.g. Kickstarter id number
- currency = original currency the project goal was denominated in
- currency_symbol = symbol of the original currency the project goal was denominated in
- currency_trailing_code = code of the original currency the project goal was denominated in
- current_currency = currency the project goal was converted to
- deadline = date and time of when the project will close for donations
- disable_communication = whether or not a project owner disabled communication with their backers
- friends = unclear (null or empty)
- fx_rate = foreign exchange rate between the original currency and the current_currency
- goal = funding goal
- id = id number of the project
- is_backing = unclear (null or false)
- is_starrable = whether or not a project can be starred (liked and saved) by users
- is_starred = whether or not a project has been starred (liked and saved) by users
- launched_at = date and time of when the project was launched for funding
- location = contains the town or city of the project creator
- name = name of the project
- permissions = unclear (null or empty)
- photo = contains a link and information to the project's photo/s
- pledged = amount pledged in the current_currency
- profile = details about the project's profile, including id number and various visual settings
- slug = name of the project with hyphens instead of spaces
- source_url = url for the project's category
- spotlight = after a project has been successful, it is spotlighted on the Kickstarter website
- staff_pick = whether a project was highlighted as a staff_pick when it was launched/live
- state = whether a project was successful, failed, canceled, suspending or still live
- state_changed_at = date and time of when a project's status was changed (same as the deadline for successful and failed projects)
- static_usd_rate = conversion rate between the original currency and USD
- urls = url to the project's page
- usd_pledged = amount pledged in USD
- usd_type = domestic or international

In [None]:
# Checking for number of duplicates of individual projects (will be addressed later)
len(df[df.duplicated(subset='id')])

In [None]:
# Checking column information
df.info()

In [None]:
# Dropping columns that are mostly null
df.drop(['friends', 'is_backing', 'is_starred', 'permissions'], axis=1, inplace=True)

Columns that arem't useful:

- converted_pledged_amount = most currencies are converted into USD in this column, but not all. Instead, the 'usd_pledged' column will be used as these all use the same currency (the dollar).
- creator = most projects are by different people, and so this cannot be usefully used to group or categorise projects, and is not useful in a machine learning context.
- currency = all currency values will be used as/converted to dollars, so that they can be evaluated together. It is not necessary to keep the original record because of this, and because it will be highly correlated with country (which will be kept).
- currency_symbol = same as above.
- currency_trailing_code = same as above.
- current_currency = same as above.
- fx_rate = this is used to create 'converted_pledged_amount' from 'pledged', but does not always convert to dollars so can be dropped in favour of 'static_usd_rate' which always converts to dollars.
- photo = image processing/computer vision will not be used in this project.
- pledged = data in this column is stored in native currencies, so this will be dropped in favour of 'usd_pledged' which is all in the same currency (dollars).
- profile = this column contains a combination of information from other columns (e.g. id, state, dates, url).
- slug = this is simply the 'name' column with hyphens instead of spaces.
- source_url = the sites that the rows were each scraped from is not useful for building a model, as each is unique to an id.
- spotlight = projects can only be spotlighted after they are already successful, so this will be entirely correlated with successful projects.
- state_changed_at = this is the same as deadline for most projects. The only exceptions are for projects which were cancelled before their deadline, but they will not be included in this analysis.
- urls = as with source_url.
- usd_type = it is unclear what this column means, but it is unlikely to be necessary since all currency values will be converted to dollars, and other currency information has been dropped.

In [None]:
# Dropping columns that aren't useful
df.drop(['converted_pledged_amount', 'creator', 'currency', 'currency_symbol', 'currency_trailing_code', 'current_currency', 'fx_rate', 'photo', 'pledged', 'profile', 'slug', 'source_url', 'spotlight', 'state_changed_at', 'urls', 'usd_type'], axis=1, inplace=True)

In [None]:
# Converting dates from unix to datetime
cols_to_convert = ['created_at', 'deadline', 'launched_at']
for c in cols_to_convert:
    df[c] = pd.to_datetime(df[c], origin='unix', unit='s')

In [None]:
# Earliest listed project date
min(df.created_at).strftime('%d %B %Y')

In [None]:
# Latest listed project date
max(df.created_at).strftime('%d %B %Y')

For the time being, natural language processing will be deferred unless time permits. However, the length of the blurbs and names will be considered.

In [None]:
# Count length of each blurb
df['blurb_length'] = df['blurb'].str.split().str.len()

# Drop blurb variable
df.drop('blurb', axis=1, inplace=True)

In [None]:
# View category syntax
df.iloc[0]['category']

In [None]:
# Extracting the relevant sub-category section from the string
f = lambda x: x['category'].split('/')[1].split('","position')[0]
df['sub_category'] = df.apply(f, axis=1)

# Extracting the relevant category section from the string, and replacing the original category variable
f = lambda x: x['category'].split('"slug":"')[1].split('/')[0]
df['category'] = df.apply(f, axis=1)
f = lambda x: x['category'].split('","position"')[0] # Some categories do not have a sub-category, so do not have a '/' to split with
df['category'] = df.apply(f, axis=1)

In [None]:
# Counting the number of unique categories
df.category.nunique()

In [None]:
# Counting the number of unique sub categories
df.sub_category.nunique()

In [None]:
# Checking the proportions of each category
df.disable_communication.value_counts(normalize=True)

In [None]:
# Drop disable communication column (because 99.7% are false)
df.drop('disable_communication', axis=1, inplace=True)

In [None]:
# Calculate new column 'usd_goal' as goal * static_usd_rate
df['usd_goal'] = round(df['goal'] * df['static_usd_rate'],2)

In [None]:
# Dropping goal and static_usd_rate
df.drop(['goal', 'static_usd_rate'], axis=1, inplace=True)

In [None]:
# Figure out what this is, and do a count_values() to figure out whether it's worth including or mostly FALSE
df.is_starrable.value_counts(normalize=True)

In [None]:
# View location syntax
df.iloc[0]['location']

In [None]:
# Counting the number of unique locations
df.location.nunique()

In [None]:
# Dropping location (too many unique locations)
df.drop('location', axis=1, inplace=True)

In [None]:
# Count length of each name
df['name_length'] = df['name'].str.split().str.len()
# Drop name variable
df.drop('name', axis=1, inplace=True)

In [None]:
df['usd_pledged'] = round(df['usd_pledged'],2)

### Feature Engineering
After dropping columns that were irrevelent and/or not useful for this project, additional features are engineered from existing features:
- time from creation to launch
- campaign length
- launch day of week
- deadline day of week, launch month
- deadline month
- launch time of day
- deadline time of day
- mean pledge per backer

In [None]:
# Time between creating and launching a project
df['creation_to_launch_days'] = df['launched_at'] - df['created_at']
df['creation_to_launch_days'] = df['creation_to_launch_days'].dt.round('d').dt.days # Rounding to nearest days, then showing as number only
# Or could show as number of hours:
# df['creation_to_launch_hours'] = df['launched_at'] - df['created_at']
# df['creation_to_launch_hours'] = df['creation_to_launch_hours'].dt.round('h') / np.timedelta64(1, 'h') 

# Campaign length
df['campaign_days'] = df['deadline'] - df['launched_at']
df['campaign_days'] = df['campaign_days'].dt.round('d').dt.days # Rounding to nearest days, then showing as number only

# Launch day of week
df['launch_day'] = df['launched_at'].dt.weekday_name

# Deadline day of week
df['deadline_day'] = df['deadline'].dt.weekday_name

# Launch month
df['launch_month'] = df['launched_at'].dt.month_name()

# Deadline month
df['deadline_month'] = df['deadline'].dt.month_name()

In [None]:
# Launch time
df['launch_hour'] = df['launched_at'].dt.hour # Extracting hour from launched_at

def two_hour_launch(row):
    '''Creates two hour bins from the launch_hour column'''
    if row['launch_hour'] in (0,1):
        return '12am-2am'
    if row['launch_hour'] in (2,3):
        return '2am-4am'
    if row['launch_hour'] in (4,5):
        return '4am-6am'
    if row['launch_hour'] in (6,7):
        return '6am-8am'
    if row['launch_hour'] in (8,9):
        return '8am-10am'
    if row['launch_hour'] in (10,11):
        return '10am-12pm'
    if row['launch_hour'] in (12,13):
        return '12pm-2pm'
    if row['launch_hour'] in (14,15):
        return '2pm-4pm'
    if row['launch_hour'] in (16,17):
        return '4pm-6pm'
    if row['launch_hour'] in (18,19):
        return '6pm-8pm'
    if row['launch_hour'] in (20,21):
        return '8pm-10pm'
    if row['launch_hour'] in (22,23):
        return '10pm-12am'
    
df['launch_time'] = df.apply(two_hour_launch, axis=1) # Calculates bins from launch_time

df.drop('launch_hour', axis=1, inplace=True)

In [None]:
# Deadline time
df['deadline_hour'] = df['deadline'].dt.hour # Extracting hour from deadline

def two_hour_deadline(row):
    '''Creates two hour bins from the deadline_hour column'''
    if row['deadline_hour'] in (0,1):
        return '12am-2am'
    if row['deadline_hour'] in (2,3):
        return '2am-4am'
    if row['deadline_hour'] in (4,5):
        return '4am-6am'
    if row['deadline_hour'] in (6,7):
        return '6am-8am'
    if row['deadline_hour'] in (8,9):
        return '8am-10am'
    if row['deadline_hour'] in (10,11):
        return '10am-12pm'
    if row['deadline_hour'] in (12,13):
        return '12pm-2pm'
    if row['deadline_hour'] in (14,15):
        return '2pm-4pm'
    if row['deadline_hour'] in (16,17):
        return '4pm-6pm'
    if row['deadline_hour'] in (18,19):
        return '6pm-8pm'
    if row['deadline_hour'] in (20,21):
        return '8pm-10pm'
    if row['deadline_hour'] in (22,23):
        return '10pm-12am'
    
df['deadline_time'] = df.apply(two_hour_deadline, axis=1) # Calculates bins from launch_time

df.drop('deadline_hour', axis=1, inplace=True)

In [None]:
# Mean pledge per backer
df['pledge_per_backer'] = round(df['usd_pledged']/df['backers_count'],2)

In [None]:
# Checking for null values
df.isna().sum()

In [None]:
# Replacing null values for blurb_length with 0
df.blurb_length.fillna(0, inplace=True)

In [None]:
# Confirming there are no null values remaining
df.isna().sum().sum()

In [None]:
# Number of projects of different states
df.state.value_counts()

In [None]:
# Dropping projects which are not successes or failures
df = df[df['state'].isin(['successful', 'failed'])]

In [None]:
# Confirming that the most recent deadline is the day on which the data was scraped, i.e. there are no projects which have yet to be resolved into either successes or failures
max(df.deadline)

In [None]:
# Checking for duplicates of individual projects, and sorting by id
duplicates = df[df.duplicated(subset='id')]

In [None]:
# Dropping duplicates which have every value in common
df.drop_duplicates(inplace=True)

In [None]:
# View leftover duplicate rows
duplicated = df[df.duplicated(subset='id', keep=False)].sort_values(by='id')
duplicated.head()

In [None]:
# Get list of index numbers for duplicated ids
dup_ids = duplicated.id.unique()
for i in dup_ids:
    index1 = duplicated[duplicated.id == i][:1].index.values
    index2 = duplicated[duplicated.id == i][1:2].index.values
    print(index1, index2)
    #print(duplicated.loc[index1] == duplicated.loc[index2]) # produces TypeError: Could not compare [None] with block values

In [None]:
df.loc[31239] == df.loc[66149]

In [None]:
df.loc[31239,['usd_goal','usd_pledge']]

In [None]:
df.loc[66149,['usd_goal','usd_pledge']]

For each pair of duplicates, there are small differences in the usd_pledge and usd_goal columns on the order of a few cents or dollars. It was decided to keep only the first of each entry.

In [None]:
df.drop_duplicates(subset='id', keep='first', inplace=True)

In [None]:
# Setting the id column as the index
df.set_index('id', inplace=True)
df.head()

### Exploratory Data Analysis
In this section, the data is explored and various plots are generated to understand general patterns within the data. Several important features are visualized in more detail, which can provide insight into what might impact the success of a project the most.

In [None]:
# Summary statistics for the numerical features
df.describe()

In [None]:
# Plotting the average amount pledged to successful and unsuccesful projects
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6), (ax7, ax8, ax9)) = plt.subplots(3, 3, figsize=(12,12))

df['state'].value_counts(ascending=True).plot(kind='bar', ax=ax1, rot=0)
ax1.set_title('Number of projects')
ax1.set_xlabel('')

df.groupby('state').usd_goal.median().plot(kind='bar', ax=ax2, rot=0)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')

df.groupby('state').usd_pledged.median().plot(kind='bar', ax=ax3, rot=0)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')

df.groupby('state').backers_count.median().plot(kind='bar', ax=ax4, rot=0)
ax4.set_title('Median backers per project')
ax4.set_xlabel('')

df.groupby('state').campaign_days.mean().plot(kind='bar', ax=ax5, rot=0)
ax5.set_title('Mean campaign length (days)')
ax5.set_xlabel('')

df.groupby('state').creation_to_launch_days.mean().plot(kind='bar', ax=ax6, rot=0)
ax6.set_title('Mean creation to launch length (days)')
ax6.set_xlabel('')

df.groupby('state').name_length.mean().plot(kind='bar', ax=ax7, rot=0)
ax7.set_title('Mean name length (words)')
ax7.set_xlabel('')

df.groupby('state').blurb_length.mean().plot(kind='bar', ax=ax8, rot=0)
ax8.set_title('Mean blurb length (words)')
ax8.set_xlabel('')

# Creating a dataframe grouped by staff_pick with columns for failed and successful
pick_df = pd.get_dummies(df.set_index('staff_pick').state).groupby('staff_pick').sum()
# Normalizes counts by column, and selects the 'True' category (iloc[1])
(pick_df.div(pick_df.sum(axis=0), axis=1)).iloc[1].plot(kind='bar', ax=ax9, rot=0) 
ax9.set_title('Proportion that are staff picks')
ax9.set_xlabel('')

fig.subplots_adjust(hspace=0.3)
plt.show()

Based on the above graphs, successful projects tend to have a smaller project goal, shorter campaign length, and longer creation to launch length. A high amount pledged, high number of backers, and staff picks are generally expected for successful projects and can be understood more as correlation than causation. The name length and blurb length are about the same for both successful and failed projects.

In [None]:
# Plotting the number of projects launched each month
plt.figure(figsize=(16,6))
df.set_index('launched_at').category.resample('MS').count().plot()
plt.xlim('2009-01-01', '2019-02-28') # Limiting to whole months
plt.xlabel('Launch date', fontsize=12)
plt.ylabel('Number of projects', fontsize=12)
plt.title('Number of projects launched on Kickstarter, 2009-2019', fontsize=16)
plt.show()

In [None]:
# Plotting the cumulative amount pledged on Kickstarter
plt.figure(figsize=(16,6))
df.set_index('launched_at').sort_index().usd_pledged.cumsum().plot()
plt.xlim('2009-01-01', '2019-02-28') # Limiting to whole months
plt.xlabel('Launch date', fontsize=12)
plt.ylabel('Cumulative amount pledged ($)', fontsize=12)
plt.title('Cumulative pledges on Kickstarter, 2009-2019', fontsize=16)
plt.show()

In [None]:
print("Average amount pledged per project in each year, in $:")
print(round(df.set_index('launched_at').usd_pledged.resample('YS').mean(),2))

In [None]:
# Plotting the distribution of pledged amounts each year
plt.figure(figsize=(16,6))
sns.boxplot(df.launched_at.dt.year, np.log(df.usd_pledged))
plt.xlabel('Year of launch', fontsize=12)
plt.ylabel('Amount pledged (log-transformed $)', fontsize=12) # Log-transforming to make the trend clearer, as the distribution is heavily positively skewed
plt.title('Amount pledged on Kickstarter projects, 2009-2019', fontsize=16)
plt.show()

In [None]:
print("Average fundraising goal per project in each year, in $:")
print(round(df.set_index('launched_at').usd_goal.resample('YS').mean(),2))

In [None]:
# Plotting the distribution of goal amounts each year
plt.figure(figsize=(16,6))
sns.boxplot(df.launched_at.dt.year, np.log(df.usd_goal))
plt.xlabel('Year of launch', fontsize=12)
plt.ylabel('Goal (log-transformed $)', fontsize=12) # Log-transforming to make the trend clearer, as the distribution is heavily positively skewed
plt.title('Fundraising goals of Kickstarter projects, 2009-2019', fontsize=16)
plt.show()

Significantly more projects are launched after 2014 than in prior years, which might imply a better chance of having a successful project if it was launched in these later years.

In [None]:
# Creating a dataframe grouped by year with columns for failed and successful
year_df = df.set_index('launched_at').state
year_df = pd.get_dummies(year_df).resample('YS').sum()

# Plotting the number and proportion of failed and successful projects each year
fig, ax = plt.subplots(1,2, figsize=(12,4))

color = cm.CMRmap(np.linspace(0.1,0.8,df.launch_day.nunique()))

year_df.plot.bar(ax=ax[0], color=color)
ax[0].set_title('Number of failed and successful projects')
ax[0].set_xlabel('')
ax[0].set_xticklabels(list(range(2009,2020)), rotation=45)

year_df.div(year_df.sum(axis=1), axis=0).successful.plot(kind='bar', ax=ax[1], color=color) # Normalizes counts across rows
ax[1].set_title('Proportion of successful projects')
ax[1].set_xlabel('')
ax[1].set_xticklabels(list(range(2009,2020)), rotation=45)

plt.show()

As it turns out, while there are a greater number of succesful projects in 2014 and later, the proportion of successful projects is significantly lower than in the years prior. This may be attributed to the large volume of projects and subsequent competition.

In [None]:
# Creating a dataframe grouped by category with columns for failed and successful
cat_df = pd.get_dummies(df.set_index('category').state).groupby('category').sum()

# Plotting
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(12,12))

color = cm.CMRmap(np.linspace(0.1,0.8,df.category.nunique())) # Setting a colormap

df.groupby('category').category.count().plot(kind='bar', ax=ax1, color=color)
ax1.set_title('Number of projects')
ax1.set_xlabel('')

df.groupby('category').usd_goal.median().plot(kind='bar', ax=ax2, color=color)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')

df.groupby('category').usd_pledged.median().plot(kind='bar', ax=ax3, color=color)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')

cat_df.div(cat_df.sum(axis=1), axis=0).successful.plot(kind='bar', ax=ax4, color=color) # Normalizes counts across rows
ax4.set_title('Proportion of successful projects')
ax4.set_xlabel('')

df.groupby('category').backers_count.median().plot(kind='bar', ax=ax5, color=color)
ax5.set_title('Median backers per project')
ax5.set_xlabel('')

df.groupby('category').pledge_per_backer.median().plot(kind='bar', ax=ax6, color=color)
ax6.set_title('Median pledged per backer ($)')
ax6.set_xlabel('')

fig.subplots_adjust(hspace=0.6)
plt.show()

Looking at the projects in the most common categories, it seems that projects in the comics and dance category have the largest proportion of successful projects (and highest amount pledged), but also have smaller project goals. Comics and games projects have the most backers, but dance and film & video projects have the most pledged per backer. Technology projects have the highest project goal, but relatively low amount pledged and low number of backers.

In [None]:
# Creating a dataframe grouped by country with columns for failed and successful
country_df = pd.get_dummies(df.set_index('country').state).groupby('country').sum()

# Plotting
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(16,12))

color = cm.CMRmap(np.linspace(0.1,0.8,df.country.nunique()))

df.groupby('country').country.count().plot(kind='bar', ax=ax1, color=color, rot=0)
ax1.set_title('Number of projects')
ax1.set_xlabel('')

df.groupby('country').usd_goal.median().plot(kind='bar', ax=ax2, color=color, rot=0)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')

df.groupby('country').usd_pledged.median().plot(kind='bar', ax=ax3, color=color, rot=0)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')

country_df.div(country_df.sum(axis=1), axis=0).successful.plot(kind='bar', ax=ax4, color=color, rot=0) # Normalizes counts across rows
ax4.set_title('Proportion of successful projects')
ax4.set_xlabel('')

df.groupby('country').backers_count.median().plot(kind='bar', ax=ax5, color=color, rot=0)
ax5.set_title('Median backers per project')
ax5.set_xlabel('')

df.groupby('country').pledge_per_backer.median().plot(kind='bar', ax=ax6, color=color, rot=0)
ax6.set_title('Median pledged per backer ($)')
ax6.set_xlabel('')

fig.subplots_adjust(hspace=0.3)
plt.show()

Although most projects originate in the United States, Hong Kong has the largest proportion of successful projects, also having the highest median pledged per project and highest number of backers.

In [None]:
# Creating a dataframe grouped by the day on which they were launched, with columns for failed and successful
day_df = pd.get_dummies(df.set_index('launch_day').state).groupby('launch_day').sum()

# Plotting
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(14,12))

color = cm.CMRmap(np.linspace(0.1,0.8,df.launch_day.nunique()))

weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df.groupby('launch_day').launch_day.count().reindex(weekdays).plot(kind='bar', ax=ax1, color=color, rot=0)
ax1.set_title('Number of projects launched')
ax1.set_xlabel('')

df.groupby('launch_day').usd_goal.median().reindex(weekdays).plot(kind='bar', ax=ax2, color=color, rot=0)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')

df.groupby('launch_day').usd_pledged.median().reindex(weekdays).plot(kind='bar', ax=ax3, color=color, rot=0)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')

day_df.div(day_df.sum(axis=1), axis=0).successful.reindex(weekdays).plot(kind='bar', ax=ax4, color=color, rot=0) # Normalizes counts across rows
ax4.set_title('Proportion of successful projects')
ax4.set_xlabel('')

df.groupby('launch_day').backers_count.median().reindex(weekdays).plot(kind='bar', ax=ax5, color=color, rot=0)
ax5.set_title('Median backers per project')
ax5.set_xlabel('')

df.groupby('launch_day').pledge_per_backer.median().reindex(weekdays).plot(kind='bar', ax=ax6, color=color, rot=0)
ax6.set_title('Median pledged per backer ($)')
ax6.set_xlabel('')

fig.subplots_adjust(hspace=0.3)
plt.show()

Most successful projects are launched on a Tuesday with a higher amount pledged and higher number of backers than other days.

In [None]:
# Creating a dataframe grouped by the month in which they were launched, with columns for failed and successful
month_df = pd.get_dummies(df.set_index('launch_month').state).groupby('launch_month').sum()

# Plotting
months = list(calendar.month_name)[1:]

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(14,12))

color = cm.CMRmap(np.linspace(0.1,0.8,df.launch_month.nunique()))

df.groupby('launch_month').launch_month.count().reindex(months).plot(kind='bar', ax=ax1, color=color, rot=45)
ax1.set_title('Number of projects launched')
ax1.set_xlabel('')
ax1.set_xticklabels(labels=ax1.get_xticklabels(), ha='right')

df.groupby('launch_month').usd_goal.median().reindex(months).plot(kind='bar', ax=ax2, color=color, rot=45)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')
ax2.set_xticklabels(labels=ax2.get_xticklabels(), ha='right')

df.groupby('launch_month').usd_pledged.median().reindex(months).plot(kind='bar', ax=ax3, color=color, rot=45)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')
ax3.set_xticklabels(labels=ax3.get_xticklabels(), ha='right')

month_df.div(month_df.sum(axis=1), axis=0).successful.reindex(months).plot(kind='bar', ax=ax4, color=color, rot=45) # Normalizes counts across rows
ax4.set_title('Proportion of successful projects')
ax4.set_xlabel('')
ax4.set_xticklabels(labels=ax4.get_xticklabels(), ha='right')

df.groupby('launch_month').backers_count.median().reindex(months).plot(kind='bar', ax=ax5, color=color, rot=45)
ax5.set_title('Median backers per project')
ax5.set_xlabel('')
ax5.set_xticklabels(labels=ax5.get_xticklabels(), ha='right')

df.groupby('launch_month').pledge_per_backer.median().reindex(months).plot(kind='bar', ax=ax6, color=color, rot=45)
ax6.set_title('Median pledged per backer ($)')
ax6.set_xlabel('')
ax6.set_xticklabels(labels=ax6.get_xticklabels(), ha='right')

fig.subplots_adjust(hspace=0.4)
plt.show()

July is the most popular month to launch a project, but it also has the lowest proportion of successful projects, lowest amount pledged, and lowest number of backers. Otherwise, there are not many clear patterns looking at the time of year of projects

In [None]:
# Creating a dataframe grouped by the time at which they were launched, with columns for failed and successful
time_df = pd.get_dummies(df.set_index('launch_time').state).groupby('launch_time').sum()

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(14,12))

color = cm.CMRmap(np.linspace(0.1,0.8,df.launch_time.nunique()))

times = ['12am-2am', '2am-4am', '4am-6am', '6am-8am', '8am-10am', '10am-12pm', '12pm-2pm', '2pm-4pm', '4pm-6pm', '6pm-8pm', '8pm-10pm', '10pm-12am']

df.groupby('launch_time').launch_time.count().reindex(times).plot(kind='bar', ax=ax1, color=color, rot=45)
ax1.set_title('Number of projects launched')
ax1.set_xlabel('')
ax1.set_xticklabels(labels=ax1.get_xticklabels(), ha='right')

df.groupby('launch_time').usd_goal.median().reindex(times).plot(kind='bar', ax=ax2, color=color, rot=45)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')
ax2.set_xticklabels(labels=ax2.get_xticklabels(), ha='right')

df.groupby('launch_time').usd_pledged.median().reindex(times).plot(kind='bar', ax=ax3, color=color, rot=45)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')
ax3.set_xticklabels(labels=ax3.get_xticklabels(), ha='right')

time_df.div(time_df.sum(axis=1), axis=0).successful.reindex(times).plot(kind='bar', ax=ax4, color=color, rot=45) # Normalizes counts across rows
ax4.set_title('Proportion of successful projects')
ax4.set_xlabel('')
ax4.set_xticklabels(labels=ax4.get_xticklabels(), ha='right')

df.groupby('launch_time').backers_count.median().reindex(times).plot(kind='bar', ax=ax5, color=color, rot=45)
ax5.set_title('Median backers per project')
ax5.set_xlabel('')
ax5.set_xticklabels(labels=ax5.get_xticklabels(), ha='right')

df.groupby('launch_time').pledge_per_backer.median().reindex(times).plot(kind='bar', ax=ax6, color=color, rot=45)
ax6.set_title('Median pledged per backer ($)')
ax6.set_xlabel('')
ax6.set_xticklabels(labels=ax6.get_xticklabels(), ha='right')

fig.subplots_adjust(hspace=0.45)
plt.show()

Looking at project launch times (in UTC/GMT), it seems that between 12pm and 2pm UTC is the best time to launch since it has the highest proportion of successful projects, highest amount pledged, and highest number of backers.

**********************
# Pandas & Sklearn

### Modeling
Finally, we train several different machine learning models to classify projects as successful or unsuccessful. Three different algorithms are used: logistic regression, random forest, and gradient boosting. The performance of these models using Scikit-Learn is compared to the performance of the same models implemented in Spark.

In [None]:
# Dropping columns and creating new dataframe
df_transformed = df.drop(['backers_count', 'created_at', 'deadline', 'is_starrable', 'launched_at', 'usd_pledged', 'sub_category', 'pledge_per_backer'], axis=1)
df_transformed.head()

In [None]:
# Exporting dataframe to csv for Spark (later section)
#df_transformed.to_csv('df.csv')

In [None]:
# Checking for colinearity
# Set the style of the visualization
sns.set(style="white")

# Create a covariance matrix
corr = df_transformed.corr()

# Generate a mask the size of our covariance matrix
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize = (11,9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5});

There are slight skews in some of the feature distributions, but should not affect the machine learning models much. There is also not much colinearity between the given features to be concerned about.

In [None]:
# Dependent variable (success and failure) converted to numerical
df_transformed['state'] = df_transformed['state'].replace({'failed': 0, 'successful': 1})

In [None]:
# Converting boolean features to string to include them in one-hot encoding
df_transformed['staff_pick'] = df_transformed['staff_pick'].astype(str)

In [None]:
# Creating dummy variables
df_transformed = pd.get_dummies(df_transformed)

In [None]:
# Separate into independent and dependent dataframes
X_unscaled = df_transformed.drop('state', axis=1)
y = df_transformed.state

In [None]:
# Transforming the data
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X_unscaled), columns=list(X_unscaled.columns))
X.head()

Before moving forward, since there is some skewness in some of the features, this needs to be addressed by log transforming the data.

In [None]:
# Assessing skewed distributions
cols_to_log = ['creation_to_launch_days', 'name_length', 'usd_goal']
df_transformed[cols_to_log].hist(figsize=(8,6));

In [None]:
# Replacing 0s with 0.01 and log-transforming
for col in cols_to_log:
    df_transformed[col] = df_transformed[col].astype('float64').replace(0.0, 0.01)
    df_transformed[col] = np.log(df_transformed[col])

In [None]:
# Checking new distributions
df_transformed[cols_to_log].hist(figsize=(8,6));

In [None]:
X_unscaled_log = df_transformed.drop('state', axis=1)
y_log = df_transformed.state

In [None]:
# Transforming the data
scaler = StandardScaler()
X_log = pd.DataFrame(scaler.fit_transform(X_unscaled_log), columns=list(X_unscaled_log.columns))
X_log.head()

In [None]:
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_log, y_log, test_size=0.3, random_state=123)

#### Logistic Regression

In [None]:
# Fitting a logistic regression model (with default parameters)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

In [None]:
# Making predictions
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

In [None]:
# Logistic regression scores
print("Logistic regression score for training set:", round(logreg.score(X_train, y_train),5))
print("Logistic regression score for test set:", round(logreg.score(X_test, y_test),5))
print("\nClassification report:")
print(classification_report(y_test, y_hat_test))

In [None]:
def plot_cf(y_true, y_pred, class_names=None, model_name=None):
    """Plots a confusion matrix"""
    cf = confusion_matrix(y_true, y_pred)
    plt.imshow(cf, cmap=plt.cm.Blues)
    plt.grid(b=None)
    if model_name:
        plt.title("Confusion Matrix: {}".format(model_name))
    else:
        plt.title("Confusion Matrix")
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    
    class_names = set(y_true)
    tick_marks = np.arange(len(class_names))
    if class_names:
        plt.xticks(tick_marks, class_names)
        plt.yticks(tick_marks, class_names)
    
    thresh = cf.max() / 2.
    
    for i, j in itertools.product(range(cf.shape[0]), range(cf.shape[1])):
        plt.text(j, i, cf[i, j], horizontalalignment='center', color='white' if cf[i, j] > thresh else 'black')

    plt.colorbar()

In [None]:
# Confusion matrix
plot_cf(y_test, y_hat_test)

In [None]:
# Plotting the AUC-ROC
y_score = logreg.fit(X_train, y_train).decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)

print('AUC:', round(auc(fpr, tpr),5))

plt.figure(figsize=(10,8))
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

The logistic regression model has a fairly good accuracy score of about 71% (test set), with similar results between the test and train set. Looking at the confusion matrix, however, it is somewhat worse at predicting failures compared to successes, and the recall is notably different between the failure and success. 

The area under the curve (of receiver operating characteristic) is decently high at 0.78, which makes logistic regression not a bad classifier. Later algorithms will try to improve on these results.

### Principal Component Analysis

Since there are a large number of features in the dataset, we will attempt to utilize principal component analysis to reduce the dimensionality of the problem.

In [None]:
pca = PCA()
pca.fit_transform(X)
explained_var = np.cumsum(pca.explained_variance_ratio_)

In [None]:
# Plotting the amount of variation explained by PCA with different numbers of components
plt.plot(list(range(1, len(explained_var)+1)), explained_var)
plt.title('Amount of variation explained by PCA', fontsize=14)
plt.xlabel('Number of components')
plt.ylabel('Explained variance');

In [None]:
print("Number of components explaining 80% of variance:", np.where(explained_var > 0.8)[0][0])
print("Number of components explaining 90% of variance:", np.where(explained_var > 0.9)[0][0])
print("Number of components explaining 99% of variance:", np.where(explained_var > 0.99)[0][0])

In [None]:
# evaluating logistic regression model using different number of components
n_comps = [58,70,90]
for n in n_comps:
    pipe = Pipeline([('pca', PCA(n_components=n)), ('clf', LogisticRegression())])
    pipe.fit(X_train, y_train)
    print("\nNumber of components:", n)
    print("Score:", round(pipe.score(X_test, y_test),5))

In [None]:
# Feature weightings on each component, in order of average weighting
pca = PCA(n_components=90)
pca.fit_transform(X)
pca_90_components = pd.DataFrame(pca.components_,columns=X.columns).T # Components as columns, features as rows
pca_90_components['mean_weight'] = pca_90_components.iloc[:].abs().mean(axis=1)
pca_90_components.sort_values('mean_weight', ascending=False)

In [None]:
# Plotting feature importances
plt.figure(figsize=(20,5))
color = cm.CMRmap(np.linspace(0.1,0.8,df.launch_day.nunique()))
pca_90_components.mean_weight.sort_values(ascending=False).plot(kind='bar', color=color)
plt.show()

The tables below show the top 10 most important features in the top three most important components.

- Component 1 = the top two features relate to the country a project is from, primarily the US and the UK (the top two most common countries).
- Component 2 = the top two features relate to whether or not a project was chosen as a staff pick.
- Component 3 = the top two features relate to the time of year the project was launched, specifically in October.

In [None]:
pca_90_components[0].map(lambda x : x).abs().sort_values(ascending = False)[:10]

In [None]:
pca_90_components[1].map(lambda x : x).abs().sort_values(ascending = False)[:10]

In [None]:
pca_90_components[2].map(lambda x : x).abs().sort_values(ascending = False)[:10]

Principal component analysis requires a large number of components (90) in order to achieve a similar prediction accuracy using plain logistic regression. 

### Logistic Regression & PCA

In [None]:
# Using GridSearchCV to test multiple different parameters
pipe_logreg = Pipeline([('pca', PCA(n_components=90)),
                    ('clf', LogisticRegression())])

params_logreg = [
    {'clf__penalty': ['l1', 'l2'],
     'clf__fit_intercept': [True, False],
        'clf__C': [0.001, 0.01, 1, 10]
    }
]

grid_logreg = GridSearchCV(estimator=pipe_logreg,
                  param_grid=params_logreg,
                  cv=5)

grid_logreg.fit(X_train, y_train)

logreg_best_score = grid_logreg.best_score_
logreg_best_params = grid_logreg.best_params_

print("Best accuracy:", round(logreg_best_score,2))
print("Best parameters:", logreg_best_params)

Results from the logistic regression parameter optimisation:

- Best accuracy: 0.71
- Best parameters: {'clf__C': 10, 'clf__fit_intercept': True, 'clf__penalty': 'l2'}

In [None]:
pipe_best_logreg = Pipeline([('pca', PCA(n_components=90)),
                    ('clf', LogisticRegression(C=10, fit_intercept=True, penalty='l2'))])

pipe_best_logreg.fit(X_train, y_train)

lr_y_hat_train = pipe_best_logreg.predict(X_train)
lr_y_hat_test = pipe_best_logreg.predict(X_test)

print("Logistic regression score for training set:", round(pipe_best_logreg.score(X_train, y_train),5))
print("Logistic regression score for test set:", round(pipe_best_logreg.score(X_test, y_test),5))
print("\nClassification report:")
print(classification_report(y_test, lr_y_hat_test))
plot_cf(y_test, lr_y_hat_test)

PCA and hyperparameter optimization using logistic regression, as expected, does not improve on the base logisitic regression model, yielding about the same accuracy. Since PCA does not significantly simplify the machine learning model training process or improve it at all, it will not be used for the following machine learning models and it will also be omitted when training machine learning models using Spark.

### Random Forest

In [None]:
# Using GridSearchCV to test multiple different parameters (can be skipped for tim)
rf = RandomForestClassifier(min_samples_split=0.001, verbose=2)

params_rf = [ 
  {'n_estimators': [200, 400],
   'max_depth': [20, 35]
  }
]

grid_rf = GridSearchCV(estimator=rf, param_grid=params_rf, cv=5)

grid_rf.fit(X_train, y_train)

rf_best_score = grid_rf.best_score_
rf_best_params = grid_rf.best_params_

print("Best accuracy:", round(rf_best_score,2))
print("Best parameters:", rf_best_params)

Results:
- Best accuracy: 0.74
- Best parameters: {'max_depth': 35, 'n_estimators': 400}

In [None]:
best_rf = RandomForestClassifier(max_depth=35, min_samples_split=0.001, n_estimators=400)

best_rf.fit(X_train, y_train)

rf_y_hat_train = best_rf.predict(X_train)
rf_y_hat_test = best_rf.predict(X_test)

print("Random Forest score for training set:", round(best_rf.score(X_train, y_train),5))
print("Random Forest score for test set:", round(best_rf.score(X_test, y_test),5))
print("\nClassification report:")
print(classification_report(y_test, rf_y_hat_test))
plot_cf(y_test, rf_y_hat_test)

In [None]:
# Plotting feature importance
n_features = X_train.shape[1]
plt.figure(figsize=(8,20))
plt.barh(range(n_features), best_rf.feature_importances_, align='center') 
plt.yticks(np.arange(n_features), X_train.columns.values) 
plt.title("Feature importances in the best Random Forest model")
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.show()

### Gradient Boosting

In [None]:
# Using GridSearchCV to test multiple different parameters

xgb = xgb.XGBClassifier(learning_rate=0.1, max_depth=35, verbose=2)

params_xgb = [ 
  {'n_estimators': [100, 200],
   'subsample': [0.7, 0.9],
   'min_child_weight': [100, 200]
  }
]

grid_xgb = GridSearchCV(estimator=xgb, param_grid=params_xgb, cv=5)

grid_xgb.fit(X_train, y_train)

xgb_best_score = grid_xgb.best_score_
xgb_best_params = grid_xgb.best_params_

print("Best accuracy:", round(xgb_best_score,2))
print("Best parameters:", xgb_best_params)

Results:
- Best accuracy: 0.75
- Best parameters: {'min_child_weight': 100, 'n_estimators': 100, 'subsample': 0.7}

In [None]:
best_xgb = xgb.XGBClassifier(learning_rate=0.1, max_depth=35, min_child_weight=100, n_estimators=100, subsample=0.7)

best_xgb.fit(X_train, y_train)

xgb_y_hat_train = best_xgb.predict(X_train)
xgb_y_hat_test = best_xgb.predict(X_test)

print("XGBoost score for training set:", round(best_xgb.score(X_train, y_train),5))
print("XGBoost score for test set:", round(best_xgb.score(X_test, y_test),5))
print("\nClassification report:")
print(classification_report(y_test, xgb_y_hat_test))
plot_cf(y_test, xgb_y_hat_test)

In [None]:
# Plotting feature importance
n_features = X_train.shape[1]
plt.figure(figsize=(8,20))
plt.barh(range(n_features), best_xgb.feature_importances_, align='center') 
plt.yticks(np.arange(n_features), X_train.columns.values) 
plt.title("Feature importances in the best XGBoost model")
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.show()

Based on the above three machine learning models, it appears that the XGBoost algorithm performed the best, yielding an accuracy score of 75.1% for the test set although each of the other algorithms yielded similarly good scores around 70%. Finding the best hyperparameters using GridSearchCV can be very time consuming and does not improve on accuracy significantly, being good for a few percentage points.

The feature importance plots are interesting and show how each algorithm weights each feature in classifying it as successful or failed. The XGBoost algorithm heavily weighted staff pick more so than for the random forest algorithm. This feature is important since it highlights the project on the site. The random forest algorithm recognizes this, but also emphasizes the importance of the USD goal in predicting success along with several other features which seem to fit with trends seen in the exploration phase. With PCA, the top components did focus on staff picks, but also on country and time of launch.

Furthermore, there might be potential value in doing natural language processing to see how the content of the blurb and name can impact the success of a project in a later analysis more so than just their length. Blurb length is highlighted in the random forest algorithm so it is not an insignificant feature.

# Spark

### Modeling
In this section, the same machine learning algorithms above are used in Pyspark and the performance is compared (with gradient boosted trees used in place of xgboost). An auxillary Python script is used to train and evaluate the models using Pyspark and AWS EMR. The job is run in AWS and the results are downloaded (printed in stdout) and included in this notebook. Specifically, the results include area under ROC and a confusion matrix for the test set. Note, a grid search was not used to find the best hyperparameters in training each of the following models like in the models above.

### AWS CLI

aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-3d6c6966","EmrManagedSlaveSecurityGroup":"sg-0f22d4dbfdcaef582","EmrManagedMasterSecurityGroup":"sg-05cfd661629f9ceb4"}' --release-label emr-5.23.0 --log-uri 's3n://aws-logs-435322005424-us-west-1/elasticmapreduce/' --steps '[{"Args":["spark-submit","--deploy-mode","cluster","s3://rl-cs696/kickstarter.py","-i","s3://rl-cs696/df.csv","-o","s3://rl-cs696/output"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Spark application"},{"Args":["spark-submit","--deploy-mode","cluster","s3://rl-cs696/kickstarter.py","-i","s3://rl-cs696/df.csv","-o","s3://rl-cs696/output"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Spark application"}]' --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"},{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"}]' --auto-scaling-role EMR_AutoScaling_DefaultRole --bootstrap-actions '[{"Path":"s3://rl-cs696/install.sh","Name":"Custom action"}]' --ebs-root-volume-size 10 --service-role EMR_DefaultRole --enable-debugging --name 'kickstarter_v4' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-west-1

### Logistic Regression
Logistic Regression, Test Area Under ROC: 0.7535184064953994  
Logistic Regression, Confusion Matrix:  
[21876,  6450],  
[ 9297, 12916]

In [None]:
print('Accuracy: ', (21876+12916)/(21876+12916+6450+9297))

### Random Forest
Random Forest, Test Area Under ROC: 0.7884483755844446  
Random Forest, Confusion Matrix:  
[25495,  2831],  
[12058, 10155]

In [None]:
print('Accuracy: ', (25495+10155)/(25495+10155+2831+12058))

### Gradient Boosted Trees
Gradient Boosted Tree, Test Area Under ROC: 0.7956970081050084  
Gradient Boosted Tree, Confusion Matrix:  
[23599,  4727],  
[ 9034, 13179]

In [None]:
print('Accuracy: ', (23599+13179)/(23599+13179+4727+9034))

Spark's machine learning algorithms perform similarly to sklearn's algorithms, with each algorithm yielding about 70% accuracy. The best model in Spark is the gradient boosted trees, which is a general implementation of the XGBoost algorithm, yielding an accuracy score of 72.8%. This is all done using default hyperparameter values and can theoretically be slightly improved using a grid search method as done above.

### Conclusion

In this project, logisitic regression, random forests, and gradient boosting algorithms were used to classify the success of a given project based on features scraped from the Kickstarter website. Each of the different models performed similarly, yielding around 70% accuracy with the gradient boosting algorithms, in both Sklearn and Spark, slightly edging out with around 75% accuracy.

Looking at both trends in the data from our exploration as well as the important features determined by some of the machine learning algorithms, we were able to gain some insight into which features can help a project succeed.

Some of the factors that had a **positive effect** on success are:

**Most important:**
- Smaller project goals
- Being chosen as a staff pick
- Shorter campaigns
- Taking longer between creation and launch
- Comics, dance and games projects

**Less important:**
- Projects from Hong Kong
- Film & video and music projects are popular categories on the site, and are fairly successful
- Launching on a Tuesday
- Launching in October
- Launching between 12pm and 2pm UTC
- Name and blurb lengths (shorter blurbs and longer names are preferred)

Factors which had a **negative effect** on success are:

**Most important:**
- Large goals
- Longer campaigns
- Food and journalism projects
- Projects from Italy

**Less important:**
- Launching on a weekend
- Launching in July or December
- Launching between 6pm and 4am UTC

Overall, Kickstarter seems to favor smaller, high-quality projects, particularly comics, dances and games which capture many of the above attributes. It is less suited to larger projects, particularly food and journalism projects. If launching a project on this platform, it will be important to keep these ideas in consideration in order to maximize success, but at the same time, it is good to understand the limitations of machine learning in this context. The data was limited to just the features given on the website, and so there are likely many more factors at play. Also, if more data could be obtained, more insight can be derived. Ultimately, combining business and marketing acumen with data will maximize the success of such crowdfunded projects.

### Pandas/Sklearn Vs Spark

The Pandas and Spark implementations performed similarly to one another in terms of accuracy scores for each machine learning model and in terms of run time and ease of implementation. The dataset, however, was not an especially large dataset, having only 209222 rows. For an even larger dataset, it can be expected that the Spark implementation will run more efficiently than using Pandas.

Cleaning the data using Pandas is much easier and interactive, especially when using Jupyter notebooks to visualize the data. It is possible to clean the data using Spark as well, but because Spark jobs are usually batch jobs, it is not as easy to debug code and visualize the data. This is why it was preferred to clean and explore the data using Pandas instead for this project before using Spark. For future steps, it will be interesting to consider Python's parallelizing dataframe libraries such as Dask and Vaex to clean and visualize large datasets. These libraries are built on top of Pandas and can be used within Jupyter, which will make the process easier and more versatile than sending jobs to AWS EMR. If these libraries can process data as fast as Spark can, this may be the preferred way for big data processing and analysis.