# Kickstarte Projects Analyzer

This code, created by [Xavi Burgos](https://xburgos.es/) analyzes distinct Kickstarter projects using a [Kaggle case](https://www.kaggle.com/kemical/kickstarter-projects), with machine learning, learned at Computer Learning course at [Universitat Autònoma de Barcelona](https://www.uab.cat/).

# 1. Importing libraries

Import all the libraries needed for the project.

In [1]:
import pandas as pd

import src.eda as eda
import src.pp as pp

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

# 2. Exploratory Data Analysis (EDA)

Explore the data using different techniques to get a better understanding of the data.

Setup the datasets:

In [2]:
# Set up the datasets
datasets = { 
	2017: {
		'path': 'data/csv/ks-projects-201612.csv',
		'encoding': 'cp1252'
	},
 	2018: {
		'path': 'data/csv/ks-projects-201801.csv',
		'encoding': 'ISO-8859-1'
	}
}

Select the first dataset, from 2016, to analyze:

In [3]:
dataset = datasets[2017]

Read CSV file and check the first rows:

In [4]:
# Get the dataframe
df = eda.get_dataframe(dataset['path'], encoding = dataset['encoding'])

# Print the dataframe 2 first rows of the dataframe
display(df.head(2))

# Print the dataframe 2 last rows of the dataframe
display(df.tail(2))

  return pd.read_csv(path, encoding=encoding)


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
323748,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13 02:00:00,15000,2016-01-13 18:13:53,200,failed,6,US,200,,,,
323749,999988282,Nou Renmen Ayiti! We Love Haiti!,Performance Art,Art,USD,2011-08-16 09:07:47,2000,2011-07-19 09:07:47,524,failed,17,US,524,,,,


The last 4 columns are unnamed and have NaN values, so let's see what they are:

In [5]:
# Get a row where the first unnamed column is not NaN
row1 = eda.filter_non_nan_column(df, 'Unnamed: 13').iloc[0]

# Get a row where the second unnamed column is not NaN
row2 = eda.filter_non_nan_column(df, 'Unnamed: 14').iloc[0]

# Get a row where the third unnamed column is not NaN
row3 = eda.filter_non_nan_column(df, 'Unnamed: 15').iloc[0]

# Get a row where the fourth unnamed column is not NaN
row4 = eda.filter_non_nan_column(df, 'Unnamed: 16').iloc[0]

# Unify the rows and display the result
res = pd.concat([row1, row2, row3, row4], axis = 1).T
display(res)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
1454,1008705746,Zephyra´s new full length,'As The World Collapses',Metal,Music,SEK,2016-02-02 00:56:46,15000,2016-01-03 00:56:46,4262,failed,14,SE,504.94765278,,,
13795,1081139420,The Rolling Stones,BEGGARS BANQUET,50 Years in the Making,Rock,Music,USD,2011-08-12 01:17:48,4625,2011-06-04 01:17:48,20,failed,2,US,20.0,,
104120,1618382802,Druid Hill Park Passport: Discover,Enjoy,Learn,Be active!,Publishing,Publishing,USD,2012-07-28 01:30:00,9500,2012-06-06 23:54:14,9854,successful,208,US,9854.0,
269970,677103185,SixSixSeven,Angels,Demons,Religion,Esoteric,Graphic Novels,Comics,USD,2015-10-10 01:00:00,750,2015-09-10 18:15:45,25,failed,1,US,25.0


Some string columns contains commas, so the CSV file is not well formatted. Later we will fix this, now let's analyze the other dataset:

In [6]:
dataset = datasets[2018]

# Get the dataframe
df = eda.get_dataframe(dataset['path'], encoding = dataset['encoding'])

# Print the dataframe 2 first rows of the dataframe
display(df.head(2))

# Print the dataframe 2 last rows of the dataframe
display(df.tail(2))

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
378659,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.0,2016-01-13 18:13:53,200.0,failed,6,US,200.0,200.0,15000.0
378660,999988282,Nou Renmen Ayiti! We Love Haiti!,Performance Art,Art,USD,2011-08-16,2000.0,2011-07-19 09:07:47,524.0,failed,17,US,524.0,524.0,2000.0


The second dataset, from 2018, has the proper format, so we will use this one until we fix the first one.

First, let's get the information about the dataset:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ID                378661 non-null  int64  
 1   name              378657 non-null  object 
 2   category          378661 non-null  object 
 3   main_category     378661 non-null  object 
 4   currency          378661 non-null  object 
 5   deadline          378661 non-null  object 
 6   goal              378661 non-null  float64
 7   launched          378661 non-null  object 
 8   pledged           378661 non-null  float64
 9   state             378661 non-null  object 
 10  backers           378661 non-null  int64  
 11  country           378661 non-null  object 
 12  usd pledged       374864 non-null  float64
 13  usd_pledged_real  378661 non-null  float64
 14  usd_goal_real     378661 non-null  float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.3+ MB


We can see that, in the columns "name" and "usd pledged", there are 378657 and 374864 Non-Null value respectively instead of 378661, so let's check the number of NaN values:

In [8]:
res = df.isnull().sum()
display(res)

# Get only the "name" and "usd pledged" columns
df_res = df[['name', 'usd pledged']]
res = (df_res.isnull().sum() / len(df_res)) * 100
res = res.apply(lambda x: '{:.4f}%'.format(x))
display(res)

ID                     0
name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

name           0.0011%
usd pledged    1.0027%
dtype: object

There are only a few NaN values, 0.0011% and 1.0027% respectively, we can try to fix them, but we will not lose much information if we drop them. Later we will see how to attend this problem in the preprocessing section.

Now, let's check if there are duplicated rows:

In [9]:
df.nunique()

ID                  378661
name                375764
category               159
main_category           15
currency                14
deadline              3164
goal                  8353
launched            378089
pledged              62130
state                    6
backers               3963
country                 23
usd pledged          95455
usd_pledged_real    106065
usd_goal_real        50339
dtype: int64

Describe the dataset to get a better understanding of the data:

In [10]:
res = df.describe(include = 'number').T
display(res)

res = df.describe(exclude = 'number').T
display(res)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,378661.0,1074731000.0,619086200.0,5971.0,538263500.0,1075276000.0,1610149000.0,2147476000.0
goal,378661.0,49080.79,1183391.0,0.01,2000.0,5200.0,16000.0,100000000.0
pledged,378661.0,9682.979,95636.01,0.0,30.0,620.0,4076.0,20338990.0
backers,378661.0,105.6175,907.185,0.0,2.0,12.0,56.0,219382.0
usd pledged,374864.0,7036.729,78639.75,0.0,16.98,394.72,3034.09,20338990.0
usd_pledged_real,378661.0,9058.924,90973.34,0.0,31.0,624.33,4050.0,20338990.0
usd_goal_real,378661.0,45454.4,1152950.0,0.01,2000.0,5500.0,15500.0,166361400.0


Unnamed: 0,count,unique,top,freq
name,378657,375764,New EP/Music Development,41
category,378661,159,Product Design,22314
main_category,378661,15,Film & Video,63585
currency,378661,14,USD,295365
deadline,378661,3164,2014-08-08,705
launched,378661,378089,1970-01-01 01:00:00,7
state,378661,6,failed,197719
country,378661,23,US,292627


Now, we will list the categories to see if there are some categories that are not well formatted:

In [11]:
res = df['category'].value_counts()
res = res.index.values
res = sorted(res)
res = ', '.join(res)
display(res)


"3D Printing, Academic, Accessories, Action, Animals, Animation, Anthologies, Apparel, Apps, Architecture, Art, Art Books, Audio, Bacon, Blues, Calendars, Camera Equipment, Candles, Ceramics, Children's Books, Childrenswear, Chiptune, Civic Design, Classical Music, Comedy, Comic Books, Comics, Community Gardens, Conceptual Art, Cookbooks, Country & Folk, Couture, Crafts, Crochet, DIY, DIY Electronics, Dance, Design, Digital Art, Documentary, Drama, Drinks, Electronic Music, Embroidery, Events, Experimental, Fabrication Tools, Faith, Family, Fantasy, Farmer's Markets, Farms, Fashion, Festivals, Fiction, Film & Video, Fine Art, Flight, Food, Food Trucks, Footwear, Gadgets, Games, Gaming Hardware, Glass, Graphic Design, Graphic Novels, Hardware, Hip-Hop, Horror, Illustration, Immersive, Indie Rock, Installations, Interactive Design, Jazz, Jewelry, Journalism, Kids, Knitting, Latin, Letterpress, Literary Journals, Literary Spaces, Live Games, Makerspaces, Metal, Mixed Media, Mobile Games, 

## 3. Preprocessing

In this section we will process the data to get it ready for the machine learning algorithms. We will drop the columns that we will not use, fix the NaN values and convert some columns to the proper type. We will also create new columns to get more information from the data.

We will copy the dataset to a new one to avoid modifying the original one:

In [12]:
new_df = df.copy()

First, let's drop the rows with NaN values. We will also drop the "ID" column, because it is not useful for the analysis. Next, we will check the number of rows deleted.

In [13]:
# Remove rows with NaN values in the "name" and "usd pledged" columns
new_df = new_df.dropna(subset = ['name'])
new_df = new_df.dropna(subset = ['usd pledged'])

# Remove the "ID" column
new_df = new_df.drop(['ID'], axis = 1)

# Get the percentage of NaN values in the dataframe
res = (new_df.isnull().sum() / len(new_df)) * 100
res = res.apply(lambda x: '{:.2f}%'.format(x))
res = pd.DataFrame(res, columns = ['NaN values (%)']).T
display(res)

res = len(df) - len(new_df)
print('Number of rows removed: {}'.format(res))

res = res / len(df) * 100
print('Percentage of rows removed: {:.2f}%'.format(res))


Unnamed: 0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
NaN values (%),0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%


Number of rows removed: 3801
Percentage of rows removed: 1.00%


To continue, we will check if there are projects where the goal is 0. If so, we will drop them, because they are not useful for the analysis. There is no sense to analyze a project that has a goal of 0, Kickstarter does not allow this.

In [14]:
# Get number of rows where usd_goal_real is 0
res = len(new_df[new_df['usd_goal_real'] == 0])
print('Number of rows where usd_goal_real is 0: {}'.format(res))

Number of rows where usd_goal_real is 0: 0


Now, we are going to make an easy step, but very useful. We will convert the "currency" column to integers, to use it in the machine learning algorithms later.

We will drop the "goal" and "pledged" columns, because they are in the currency of the country, and we will use the "usd_goal_real" and "usd_pledged_real" columns instead, in USD. We will also drop the "usd pledged" column, made by Kickstarter, because it is not accurate, we will use the "usd_pledged_real" column instead, made by Fixer.io API (a currency converter API).

In [15]:
# Get current currencies
currencies = new_df['currency'].value_counts()
currencies = currencies.index.values
currencies = sorted(currencies)

# Print the currencies
res = ', '.join(currencies)
display(res)

'AUD, CAD, CHF, DKK, EUR, GBP, HKD, JPY, MXN, NOK, NZD, SEK, SGD, USD'

In [16]:
# Convert the "currency" column to an integer column using the "currencies" list
new_df['currency'] = new_df.apply(lambda x: currencies.index(x['currency']), axis = 1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,The Songs of Adelaide & Abullah,Poetry,Publishing,5,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,13,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0


In [17]:
# Remove the "goal", "pledged" and "usd pledged" columns
new_df = new_df.drop(['goal', 'pledged', 'usd pledged'], axis = 1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,The Songs of Adelaide & Abullah,Poetry,Publishing,5,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,13,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0


We are going to make the same with the "category" and "main_category" columns, because they are categorical variables and we will use them later in the machine learning algorithms.

In [18]:
# Get current categories
categories = new_df['category'].value_counts()
categories = categories.index.values
categories = sorted(categories)

# Print the categories
res = ', '.join(categories)
display(res)

"3D Printing, Academic, Accessories, Action, Animals, Animation, Anthologies, Apparel, Apps, Architecture, Art, Art Books, Audio, Bacon, Blues, Calendars, Camera Equipment, Candles, Ceramics, Children's Books, Childrenswear, Chiptune, Civic Design, Classical Music, Comedy, Comic Books, Comics, Community Gardens, Conceptual Art, Cookbooks, Country & Folk, Couture, Crafts, Crochet, DIY, DIY Electronics, Dance, Design, Digital Art, Documentary, Drama, Drinks, Electronic Music, Embroidery, Events, Experimental, Fabrication Tools, Faith, Family, Fantasy, Farmer's Markets, Farms, Fashion, Festivals, Fiction, Film & Video, Fine Art, Flight, Food, Food Trucks, Footwear, Gadgets, Games, Gaming Hardware, Glass, Graphic Design, Graphic Novels, Hardware, Hip-Hop, Horror, Illustration, Immersive, Indie Rock, Installations, Interactive Design, Jazz, Jewelry, Journalism, Kids, Knitting, Latin, Letterpress, Literary Journals, Literary Spaces, Live Games, Makerspaces, Metal, Mixed Media, Mobile Games, 

In [19]:
# Get current main categories
main_categories = new_df['main_category'].value_counts()
main_categories = main_categories.index.values
main_categories = sorted(main_categories)

# Print the main categories
res = ', '.join(main_categories)
display(res)

'Art, Comics, Crafts, Dance, Design, Fashion, Film & Video, Food, Games, Journalism, Music, Photography, Publishing, Technology, Theater'

In [20]:
# Convert the "category" column to an integer column using the "categories" list
new_df['category'] = new_df.apply(lambda x: categories.index(x['category']), axis = 1)

# Convert the "main_category" column to an integer column using the "main_categories" list
new_df['main_category'] = new_df.apply(lambda x: main_categories.index(x['main_category']), axis = 1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,The Songs of Adelaide & Abullah,108,12,5,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0


As the same that we did in the previous steps, we will convert the "country" column to integers, to use it in the machine learning algorithms later.

In [21]:
# Get current countries
countries = new_df['country'].value_counts()
countries = countries.index.values
countries = sorted(countries)

# Print the countries
res = ', '.join(countries)
display(res)

'AT, AU, BE, CA, CH, DE, DK, ES, FR, GB, HK, IE, IT, JP, LU, MX, NL, NO, NZ, SE, SG, US'

In [22]:
# Convert the "country" column to an integer column using the "countries" list
new_df['country'] = new_df.apply(lambda x: countries.index(x['country']), axis = 1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,The Songs of Adelaide & Abullah,108,12,5,2015-10-09,2015-08-11 12:12:28,failed,0,9,0.0,1533.95
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,2017-11-01,2017-09-02 04:43:57,failed,15,21,2421.0,30000.0


Now, we will get the days between the launch and deadline dates:

In [23]:
# Get the "days" column from the "deadline" and "launched" columns
new_df['days'] = (pd.to_datetime(new_df['deadline']) - pd.to_datetime(new_df['launched'])).dt.days

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,days
0,The Songs of Adelaide & Abullah,108,12,5,2015-10-09,2015-08-11 12:12:28,failed,0,9,0.0,1533.95,58
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,2017-11-01,2017-09-02 04:43:57,failed,15,21,2421.0,30000.0,59


Now, we are going to make some days ranges to get more information from the data. We will use it to replace the "days" column.

In [24]:
days_ranges = [0, 1, 2, 3, 5, 7, 10, 14, 21, 30, 45] # Minimum of 1 day and maximum of 60 days (Kickstarter limit)

In [25]:
# Define the function to get the days range
def get_days_range(days):
	for i in range(len(days_ranges)):
		if days <= days_ranges[i]:
			return i
	return len(days_ranges)

# Convert the "days" column to an integer column using the "days_ranges" list
new_df['days'] = new_df.apply(lambda x: get_days_range(x['days']), axis = 1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,days
0,The Songs of Adelaide & Abullah,108,12,5,2015-10-09,2015-08-11 12:12:28,failed,0,9,0.0,1533.95,11
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,2017-11-01,2017-09-02 04:43:57,failed,15,21,2421.0,30000.0,11


We will get the launched months from the "launched" column, and we will drop the "deadline" column, because we already have the days between, that is more useful:

In [26]:
# Get the launched month column from the "launched" column
new_df['launched'] = pd.to_datetime(new_df['launched']).dt.month

# Remove the "launched" and "deadline" columns
new_df = new_df.drop(columns = ['deadline'])

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,launched,state,backers,country,usd_pledged_real,usd_goal_real,days
0,The Songs of Adelaide & Abullah,108,12,5,8,failed,0,9,0.0,1533.95,11
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,9,failed,15,21,2421.0,30000.0,11


Now, we will get the "cancelled" column from the "state" column, and we will drop the "state" column, because we are not going to use it anymore:

In [27]:
# Get the "cancelled" column from the "state" column
new_df['cancelled'] = new_df.apply(lambda x: 1 if x['state'] == 'canceled' else 0, axis = 1)

# Remove the "state" column
new_df = new_df.drop(columns = ['state'])

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,launched,backers,country,usd_pledged_real,usd_goal_real,days,cancelled
0,The Songs of Adelaide & Abullah,108,12,5,8,0,9,0.0,1533.95,11,0
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,9,15,21,2421.0,30000.0,11,0


Now, we are going to make some price ranges to get more information from the data. We will use it to replace the "usd_pledged_real" and "usd_goal_real" columns, because they are more useful than the original ones. Also, we will rename the columns to "goal" and "pledged" to simplify the names.

In [28]:
# Price ranges
price_ranges = [0, 100, 500, 1000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000, 5000000, 10000000]

In [29]:
# Define the function to get the price range
def get_price_range(x):
	for i in range(len(price_ranges)):
		if x <= price_ranges[i]:
			return i
	return len(price_ranges)

# Get the "pledged" column from the "usd_pledged_real" column
new_df['pledged'] = new_df.apply(lambda x: get_price_range(x['usd_pledged_real']), axis = 1)

# Get the "goal" column from the "usd_goal_real" column
new_df['goal'] = new_df.apply(lambda x: get_price_range(x['usd_goal_real']), axis = 1)

# Remove the "usd_pledged_real" and "usd_goal_real" columns
new_df = new_df.drop(columns = ['usd_pledged_real', 'usd_goal_real'])

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,launched,backers,country,days,cancelled,pledged,goal
0,The Songs of Adelaide & Abullah,108,12,5,8,0,9,11,0,0,4
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,9,15,21,11,0,4,7


We will do the same with the "backers" column, to get more information from the data.

In [30]:
# Backers ranges
backers_ranges = [0, 1, 2, 3, 5, 7, 10, 14, 21, 30, 45, 60, 90, 120, 150, 200, 300, 500, 1000, 5000, 10000]

In [31]:
# Define the function to get the backers range
def get_backers_range(x):
	for i in range(len(backers_ranges)):
		if x <= backers_ranges[i]:
			return i
	return len(backers_ranges)

# Get the "backers" column from the "backers" column
new_df['backers'] = new_df.apply(lambda x: get_backers_range(x['backers']), axis = 1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

Unnamed: 0,name,category,main_category,currency,launched,backers,country,days,cancelled,pledged,goal
0,The Songs of Adelaide & Abullah,108,12,5,8,0,9,11,0,0,4
1,Greeting From Earth: ZGAC Arts Capsule For ET,93,6,13,9,8,21,11,0,4,7


Now, we will get keywords from the "name" column. This can be useful to get more information from the data. We will use the NLTK library to get the keywords.

In [32]:
# Define the function to get the keywords
stemmer = pp.get_stemmer()
stop_words = pp.get_stopwords()
def get_keywords(text):
  text = pp.get_keywords(text, stemmer, stop_words)
  return text

# Apply the function to the "info" column to get the "keywords" column
new_df['keywords'] = new_df['name'].apply(get_keywords)

# Remove the "name" column
new_df = new_df.drop(columns = ['name'])

# Create the bag of words
vectorizer = CountVectorizer(max_features = 1000)
bag_of_words = vectorizer.fit_transform(new_df['keywords'])

# Convert the bag of words into a DataFrame
bow_df = pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names_out())

# Remove the "keywords" column
new_df = new_df.drop(columns = ['keywords'])

# Concatenate the two dataframes
final_df = pd.concat([new_df.reset_index(drop=True), bow_df.reset_index(drop=True)], axis=1)

# Print 2 first rows of the dataframe
display(new_df.head(2))

# Print 2 first rows of the final dataframe
display(final_df.head(2))

[nltk_data] Downloading package stopwords to /Users/xavi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/xavi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,category,main_category,currency,launched,backers,country,days,cancelled,pledged,goal
0,108,12,5,8,0,9,11,0,0,4
1,93,6,13,9,8,21,11,0,4,7


Unnamed: 0,category,main_category,currency,launched,backers,country,days,cancelled,pledged,goal,...,writ,year,yog,york,you,young,youtub,zero,zin,zomby
0,108,12,5,8,0,9,11,0,0,4,...,0,0,0,0,0,0,0,0,0,0
1,93,6,13,9,8,21,11,0,4,7,...,0,0,0,0,0,0,0,0,0,0


## 4. Metric selection

In this section we will select the metrics that we will use to analyze the data. We will use the metrics that we have created in the previous section.

We will use a combination of some metrics to get more information from the data. We will use the "Accuracy" and "F1 Score" metrics to evaluate the machine learning algorithms.

- Accuracy: Explains a general view of the model performance. But it can be misleading if there is a class imbalance problem.
- F1 Score: It is the harmonic mean between precision and recall. It is a better metric than accuracy when there is a class imbalance problem.

The "Confusion Matrix" is a good way to see the performance of the machine learning algorithms. It shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data.

The "Classification Report" is another good way to see the performance of the machine learning algorithms. It shows the precision, recall, f1-score and support for each class.

## 5. Random Forest

We will use the Random Forest algorithm to predict if a project will be successful or not. We want to predict the "pledged" column. We will use the "goal", "backers", "country", "category", "main_category", "currency", "days", "launched_month" and the keywords columns as features.

First, we will split the dataset into train and test datasets:

In [33]:
# Split the dataframe into train and test sets
target = 'pledged'
X_train, X_test, y_train, y_test = train_test_split(
  final_df.drop(target, axis=1),
  final_df[target],
  test_size=0.2,
  random_state=42
)

Now, we will train the model:

In [34]:
# Initialize the random forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

At this point, we have trained the model, now we will predict the test dataset:

In [35]:
# Realize the predictions on the test set
y_pred = rf.predict(X_test)

Once we have predicted the test dataset, we will evaluate the model:

In [43]:
#Print the accuracy of the model
res = accuracy_score(y_test, y_pred)
print(res)

# Print the F1 score of the model
res = f1_score(y_test, y_pred, average='weighted')
print(res)

# Set the "accuracy" metric for classification report
metrics = ['accuracy']

# Print the classification report
res = classification_report(y_test, y_pred, zero_division=0, output_dict=True)
res = pd.DataFrame(res).transpose()
display(res)

# Print the confusion matrix
res = confusion_matrix(y_test, y_pred)
res = pd.DataFrame(res)
display(res) 

0.6753721389318679
0.6584300058548553


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,10281,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,12614,1667,34,41,1,0,1,0,0,0,0,0,0,0
2,0,2958,6241,726,981,14,0,0,0,0,0,0,0,0,0
3,0,298,1864,1229,2484,33,10,0,0,0,0,0,0,0,0
4,0,185,1076,775,13441,932,279,55,4,1,0,0,0,0,0
5,0,18,68,39,2816,2419,854,191,14,2,1,0,0,0,0
6,0,14,24,12,812,1149,2227,591,24,4,1,0,0,0,0
7,0,2,5,2,323,177,824,1798,163,27,8,2,0,0,0
8,0,0,1,0,50,45,113,599,231,63,14,4,1,0,0
9,0,0,2,1,19,7,36,216,157,100,42,7,2,0,0


Given the type of problem that we are trying to solve, the metrics of 0.67 in accuracy and 0.65 in f1-score are not bad. We can see that the model is not overfitting, because the metrics are similar in the train and test datasets.