# ML Project Group Ivo, Salvo, Kevin

Kickstarter Project

## About Dataset
Kickstarter is a popular crowdfunding platform that has helped thousands of entrepreneurs and creators bring their innovative ideas to life. However, not all Kickstarter projects are successful, and understanding the factors that contribute to success or failure can be valuable for both creators and investors alike.

In this dataset, we have collected information on a large number of Kickstarter projects and whether they ultimately succeeded or failed to meet their funding goals. This dataset includes a wide range of project types, including technology startups, creative arts endeavors, and social impact initiatives, among others.

By analyzing this dataset, researchers and analysts can gain insights into the characteristics of successful and unsuccessful Kickstarter projects, such as funding targets, project categories, and funding sources. This information can be used to inform investment decisions and guide future crowdfunding campaigns.

Overall, this dataset provides a comprehensive look at the Kickstarter ecosystem and can serve as a valuable resource for anyone interested in understanding the dynamics of crowdfunding and the factors that contribute to project success or failure.

### Assumptions About the Data : 

**ID** Can be dropped or transformed - 

**Name** - maybe length of the name can be an indicator of how successfull it is founded - can be dropped initaly - no learing effect for model

**Category** - seems reasonable - keep

**Subcategorory** - every subcategory can be assigned to a Category - maybe drop for first analysis - "maybe hotencode subcategory and category"

**Country** - Self Explained - keep

**Launched** - how long are projects in "funding" ? - cut timestamp only keep date - keep

**Deadline** - maybe make prediction about how long it max takes for a project to be successfully funded ? - keep

**Goal**  drop NaN/Zero Values -> goal automatically reached at 0 / maybe cut outliers  / threshold maybe at 100/1000ish - keep

**Pledged** - always successfull when > goal - always "live" when deadline not met and < goal - always failed when < goal and deadline done - correaltes with state (target) column - keep

**Backers** - maybe not that relevant / shows popularity / - keep for now for further analysis ( might be dropped later)

**State** - Feature - keep



In [1]:
#load the data
import pandas as pd
data = pd.read_csv('data/kickstarter_projects.csv')

In [2]:
data.head()

Unnamed: 0,ID,Name,Category,Subcategory,Country,Launched,Deadline,Goal,Pledged,Backers,State
0,1860890148,Grace Jones Does Not Give A F$#% T-Shirt (limi...,Fashion,Fashion,United States,2009-04-21 21:02:48,2009-05-31,1000,625,30,Failed
1,709707365,CRYSTAL ANTLERS UNTITLED MOVIE,Film & Video,Shorts,United States,2009-04-23 00:07:53,2009-07-20,80000,22,3,Failed
2,1703704063,drawing for dollars,Art,Illustration,United States,2009-04-24 21:52:03,2009-05-03,20,35,3,Successful
3,727286,Offline Wikipedia iPhone app,Technology,Software,United States,2009-04-25 17:36:21,2009-07-14,99,145,25,Successful
4,1622952265,Pantshirts,Fashion,Fashion,United States,2009-04-27 14:10:39,2009-05-26,1900,387,10,Failed


In [3]:
#check for missing values

data.isnull().sum()

ID             0
Name           0
Category       0
Subcategory    0
Country        0
Launched       0
Deadline       0
Goal           0
Pledged        0
Backers        0
State          0
dtype: int64

In [4]:
#check data types
data.dtypes

ID              int64
Name           object
Category       object
Subcategory    object
Country        object
Launched       object
Deadline       object
Goal            int64
Pledged         int64
Backers         int64
State          object
dtype: object

In [5]:
#check unique values in "categories" column
data['Category'].unique()

array(['Fashion', 'Film & Video', 'Art', 'Technology', 'Journalism',
       'Publishing', 'Theater', 'Music', 'Photography', 'Games', 'Design',
       'Food', 'Crafts', 'Comics', 'Dance'], dtype=object)

In [6]:
#check unique values in "SubCategory" column
data['Subcategory'].unique()

array(['Fashion', 'Shorts', 'Illustration', 'Software', 'Journalism',
       'Fiction', 'Theater', 'Rock', 'Photography', 'Puzzles',
       'Graphic Design', 'Film & Video', 'Publishing', 'Documentary',
       'Sculpture', 'Electronic Music', 'Nonfiction', 'Food', 'Painting',
       'Indie Rock', 'Video Games', 'Public Art', 'Product Design', 'Art',
       "Children's Books", 'Crafts', 'Jazz', 'Music', 'Comics',
       'Narrative Film', 'Tabletop Games', 'Digital Art', 'Animation',
       'Conceptual Art', 'Pop', 'Hip-Hop', 'Country & Folk',
       'Periodicals', 'Webseries', 'Performance Art', 'Technology',
       'Art Books', 'World Music', 'Knitting', 'Classical Music',
       'Poetry', 'Graphic Novels', 'Radio & Podcasts', 'Design',
       'Hardware', 'Webcomics', 'Dance', 'Translations', 'Crochet',
       'Games', 'Photo', 'Mixed Media', 'Space Exploration', 'Photobooks',
       'Musical', 'Audio', 'Community Gardens', 'R&B',
       'Fabrication Tools', 'Textiles', 'Architecture',

In [7]:
#count unique values in "Subcategory" column
data['Subcategory'].nunique()

159

In [8]:
# range of the "Goal" column
data['Goal'].min(), data['Goal'].max()

(0, 166361391)

## First styling of the Dataset 
DROP ID and Timespam in "launched"

In [9]:
#drop ID Column    
data = data.drop(columns=['ID'])

In [10]:
#drop timestamp in "launched" column
data['Launched'] = pd.to_datetime(data['Launched']).dt.normalize()
data['Launched']
data['Deadline'] = pd.to_datetime(data['Deadline']).dt.normalize()
data['Deadline']

0        2009-05-31
1        2009-07-20
2        2009-05-03
3        2009-07-14
4        2009-05-26
            ...    
374848   2018-01-16
374849   2018-02-09
374850   2018-01-16
374851   2018-02-01
374852   2018-01-26
Name: Deadline, Length: 374853, dtype: datetime64[ns]

## Modeling baseline Dataset and Advaned Dataset

In [11]:
#create 2 different dataframes for a baseline model and an advanced model
baseline_data = data.copy()
advanced_data = data.copy()

In [12]:
# drop subcategory column for baseline model
baseline_data = baseline_data.drop(columns=['Subcategory'])
#drop backers in baseline model 
baseline_data = baseline_data.drop(columns=['Backers'])

In [13]:
baseline_data.head()

Unnamed: 0,Name,Category,Country,Launched,Deadline,Goal,Pledged,State
0,Grace Jones Does Not Give A F$#% T-Shirt (limi...,Fashion,United States,2009-04-21,2009-05-31,1000,625,Failed
1,CRYSTAL ANTLERS UNTITLED MOVIE,Film & Video,United States,2009-04-23,2009-07-20,80000,22,Failed
2,drawing for dollars,Art,United States,2009-04-24,2009-05-03,20,35,Successful
3,Offline Wikipedia iPhone app,Technology,United States,2009-04-25,2009-07-14,99,145,Successful
4,Pantshirts,Fashion,United States,2009-04-27,2009-05-26,1900,387,Failed


In [14]:
#nunique values in state column
baseline_data['State'].value_counts()

State
Failed        197611
Successful    133851
Canceled       38751
Live            2798
Suspended       1842
Name: count, dtype: int64

In [15]:
baseline_data.Country.value_counts()

Country
United States     292618
United Kingdom     33671
Canada             14756
Australia           7839
Germany             4171
France              2939
Italy               2878
Netherlands         2868
Spain               2276
Sweden              1757
Mexico              1752
New Zealand         1447
Denmark             1113
Ireland              811
Switzerland          760
Norway               708
Hong Kong            618
Belgium              617
Austria              597
Singapore            555
Luxembourg            62
Japan                 40
Name: count, dtype: int64

# EDA of Baseline Dataset

What to find out? 

- Category encoden? 
- Country encoden - Imbalance 70% USA ? 
- Names??? 
- Projektzeitraum aus launched und deadline - new column
- Zielerreichung - new column aus Goal und Pledged
---

- when goal 0 = successfull ? -> cut
- canceld cuttable? 
- Goal Threshold? 
- Suspended why? 
- more successfull in country x? 
- Goel / State relation? 
- Goel / Pledged relation? 
- Top 10% Goals Failed or Successfull? 

In [16]:
#check for duplicates
baseline_data.duplicated().sum()

#show both duplicate rows
baseline_data[baseline_data.duplicated(keep=False)]

#drop duplicates
baseline_data = baseline_data.drop_duplicates()

In [17]:
#check if duplicates are dropped
baseline_data.duplicated().sum()

0

In [18]:
baseline_data["State"].value_counts(normalize=True)

State
Failed        0.527168
Successful    0.357079
Canceled      0.103375
Live          0.007464
Suspended     0.004914
Name: proportion, dtype: float64

In [19]:
baseline_data["Country"].value_counts(normalize=True)

Country
United States     0.780619
United Kingdom    0.089825
Canada            0.039365
Australia         0.020912
Germany           0.011127
France            0.007840
Italy             0.007678
Netherlands       0.007651
Spain             0.006072
Sweden            0.004687
Mexico            0.004674
New Zealand       0.003860
Denmark           0.002969
Ireland           0.002164
Switzerland       0.002027
Norway            0.001889
Hong Kong         0.001649
Belgium           0.001646
Austria           0.001593
Singapore         0.001481
Luxembourg        0.000165
Japan             0.000107
Name: proportion, dtype: float64