![logo](1_bDwEvCRgrKVbLrAXEixpfA.png)
___

##### import libraries

In [7]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
#from scipy import stats
%matplotlib inline 
sns.set(color_codes=True)

#natural language processing
#pip install nltk
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vlad_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now that we have seen the results of our initial classification models, we can draw a few conclusions. First, the model is using words from the entire Kickstarter dataset to predict the 'state' of a campaign. This is useful, however, it would make more sense to fit these classifications to the seperate genres of the kickstarter dataset. The poor performace of our previous classification algorithms also supports this.

By splitting our data and only looking at the top 5 genres of the dataset, I hope to increase the performance of the models, as well as have a more effective and insightful conclusion that future Kickstarter entrepreneurs can use. ie) If someone looking to start a movie campaign on Kickstarter wants to see what descriptive words perform well on the website, they won't find the previous classification models useful. Classification that uses all the words of the entire corpus to predict the campaign state is too general and saturated to provide any real-world implications for an actually Kickstarter campaign. Therefore, Ithe following sections will explore a more genre-specific set of data. 

# Step 5 - Filtering Data & NLP
    a) Importing data
    
    b) Filtering
        -seperate cleaned2_df top-5 genres into their own dataframes
    
    c) Create Corpora
        -use previous NLP techniques to build seperate corpora for each new dataframe
    
    d) Save new data to .csv

### a) Importing data

In [8]:
#import cleaned2_data.csv file
cleaned2_df = pd.read_csv("cleaned2_data.csv")

cleaned2_df.drop(['Unnamed: 0'], axis=1, inplace=True)

print(cleaned2_df.shape)
print(cleaned2_df['subgenre'].value_counts())

(114898, 22)
music          18849
film           15591
publishing     12487
art            12317
food            9494
technology      8771
fashion         5684
comics          5534
games           5180
theater         3940
photography     3873
crafts          3763
design          3415
dance           3248
journalism      2752
Name: subgenre, dtype: int64


### b) Filter Data

In [9]:
#filter top-5 subgenres into their own dataframes

music_df = cleaned2_df.loc[cleaned2_df['subgenre'] == 'music']
music_df.reset_index(drop=True,inplace=True)

film_df = cleaned2_df.loc[cleaned2_df['subgenre'] == 'film']
film_df.reset_index(drop=True,inplace=True)

publishing_df = cleaned2_df.loc[cleaned2_df['subgenre'] == 'publishing']
publishing_df.reset_index(drop=True,inplace=True)

art_df = cleaned2_df.loc[cleaned2_df['subgenre'] == 'art']
art_df.reset_index(drop=True,inplace=True)

food_df = cleaned2_df.loc[cleaned2_df['subgenre'] == 'food']
food_df.reset_index(drop=True,inplace=True)

### c) Create Corpora

##### music_df

In [10]:
length = len(pd.DataFrame(music_df['blurb']))

music_corpus = []
for i in range(0, length):
    #only keep letters and replace other symbols with a white space in the first blurb
    blurb = re.sub('[^a-zA-Z]', ' ', music_df['blurb'][i])

    #change letters to lower-case
    blurb = blurb.lower()   
    #split
    blurb = blurb.split()
    #for loop to remove stop-words and to do stemming
    wn = WordNetLemmatizer()
    blurb = [wn.lemmatize(word) for word in blurb if not word in set(stopwords.words('english'))]
    #join blurb back into a string from a list
    blurb = ' '.join(blurb)
    music_corpus.append(blurb)

In [11]:
#write into a dataframe
music_corpus_df = pd.DataFrame(music_corpus, columns = ['blurb'])
music_corpus_df.head()

Unnamed: 0,blurb
0,track collection new original musical piece jo...
1,family eric garner need help raising money cre...
2,washington dc rock n roll band need help press...
3,four year since released solo record last card...
4,denver rock band hate seek fund release second...


##### film_df

In [12]:
length = len(pd.DataFrame(film_df['blurb']))

film_corpus = []
for i in range(0, length):
    #only keep letters and replace other symbols with a white space in the first blurb
    blurb = re.sub('[^a-zA-Z]', ' ', film_df['blurb'][i])

    #change letters to lower-case
    blurb = blurb.lower()   
    #split
    blurb = blurb.split()
    #for loop to remove stop-words and to do stemming
    wn = WordNetLemmatizer()
    blurb = [wn.lemmatize(word) for word in blurb if not word in set(stopwords.words('english'))]
    #join blurb back into a string from a list
    blurb = ' '.join(blurb)
    film_corpus.append(blurb)

In [13]:
#write into a dataframe
film_corpus_df = pd.DataFrame(film_corpus, columns = ['blurb'])
film_corpus_df.head()

Unnamed: 0,blurb
0,genesis follows mother son live together forei...
1,losing fellow marine overseas sgt john casey d...
2,l short fan film follows spartan linda event h...
3,four friend embark road trip together hope ove...
4,juxtaposing delicious food good friend unsettl...


##### publishing_df

In [16]:
length = len(pd.DataFrame(publishing_df['blurb']))

publishing_corpus = []
for i in range(0, length):
    #only keep letters and replace other symbols with a white space in the first blurb
    blurb = re.sub('[^a-zA-Z]', ' ', publishing_df['blurb'][i])

    #change letters to lower-case
    blurb = blurb.lower()   
    #split
    blurb = blurb.split()
    #for loop to remove stop-words and to do stemming
    wn = WordNetLemmatizer()
    blurb = [wn.lemmatize(word) for word in blurb if not word in set(stopwords.words('english'))]
    #join blurb back into a string from a list
    blurb = ' '.join(blurb)
    publishing_corpus.append(blurb)

In [17]:
#write into a dataframe
publishing_corpus_df = pd.DataFrame(publishing_corpus, columns = ['blurb'])
publishing_corpus_df.head()

Unnamed: 0,blurb
0,thema literary society proposes publish one th...
1,collection hand lettered thing say idea sentim...
2,monster skulking dark king bent destroying lig...
3,spark child interest write draw anytime anywhe...
4,wild word literary magazine set free ad fee


##### art_df

In [14]:
length = len(pd.DataFrame(art_df['blurb']))

art_corpus = []
for i in range(0, length):
    #only keep letters and replace other symbols with a white space in the first blurb
    blurb = re.sub('[^a-zA-Z]', ' ', art_df['blurb'][i])

    #change letters to lower-case
    blurb = blurb.lower()   
    #split
    blurb = blurb.split()
    #for loop to remove stop-words and to do stemming
    wn = WordNetLemmatizer()
    blurb = [wn.lemmatize(word) for word in blurb if not word in set(stopwords.words('english'))]
    #join blurb back into a string from a list
    blurb = ' '.join(blurb)
    art_corpus.append(blurb)

In [15]:
#write into a dataframe
art_corpus_df = pd.DataFrame(art_corpus, columns = ['blurb'])
art_corpus_df.head()

Unnamed: 0,blurb
0,th anniversary year fsf revolutionary abstract...
1,let paint tv going search creativity space mis...
2,one unique space scene painted ink metallic se...
3,mosaic mural westside avenue adult center vent...
4,project make animal mask artistic skill develo...


##### food_df

In [18]:
length = len(pd.DataFrame(food_df['blurb']))

food_corpus = []
for i in range(0, length):
    #only keep letters and replace other symbols with a white space in the first blurb
    blurb = re.sub('[^a-zA-Z]', ' ', food_df['blurb'][i])

    #change letters to lower-case
    blurb = blurb.lower()   
    #split
    blurb = blurb.split()
    #for loop to remove stop-words and to do stemming
    wn = WordNetLemmatizer()
    blurb = [wn.lemmatize(word) for word in blurb if not word in set(stopwords.words('english'))]
    #join blurb back into a string from a list
    blurb = ' '.join(blurb)
    food_corpus.append(blurb)

In [19]:
#write into a dataframe
food_corpus_df = pd.DataFrame(food_corpus, columns = ['blurb'])
food_corpus_df.head()

Unnamed: 0,blurb
0,drink fine wine le work
1,help make stockton next beer destination suppo...
2,crafting spectacular wine using grape many und...
3,racine brewing company newest nano brewery rac...
4,raising honey bee leasing farmer crop pollinat...


##### write .csv files

In [21]:
#append music_corpus blurb to music_df
music_df['blurb_corpus'] = music_corpus_df['blurb']
print(music_df.shape)
music_df.head(1)

(18849, 23)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,name,genre,subgenre,category,source_url,blurb,slug,goal,converted_pledged_amount,...,country,currency,backers_count,disable_communication,is_starrable,spotlight,staff_pick,state,success_percentage,blurb_corpus
0,177511186,i am worthy,R&B,music,"{""id"":322,""name"":""R&B"",""slug"":""music/r&b"",""pos...",https://www.kickstarter.com/discover/categorie...,a 20 track collection of new original musical ...,i-am-worthy-0,25000,19213,...,US,USD,108,0,0,0,0,failed,76.852,track collection new original musical piece jo...


In [22]:
#append film_corpus blurb to film_df
film_df['blurb_corpus'] = film_corpus_df['blurb']
print(film_df.shape)
film_df.head(1)

(15591, 23)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,name,genre,subgenre,category,source_url,blurb,slug,goal,converted_pledged_amount,...,country,currency,backers_count,disable_communication,is_starrable,spotlight,staff_pick,state,success_percentage,blurb_corpus
0,69573339,Genesis,Science Fiction,film,"{""id"":301,""name"":""Science Fiction"",""slug"":""fil...",https://www.kickstarter.com/discover/categorie...,Genesis follows a mother and son that live tog...,genesis-5,7000,7415,...,US,USD,72,0,0,1,0,successful,105.928571,genesis follows mother son live together forei...


In [23]:
#append art_corpus blurb to art_df
art_df['blurb_corpus'] = art_corpus_df['blurb']
print(art_df.shape)
art_df.head(1)

(12317, 23)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,name,genre,subgenre,category,source_url,blurb,slug,goal,converted_pledged_amount,...,country,currency,backers_count,disable_communication,is_starrable,spotlight,staff_pick,state,success_percentage,blurb_corpus
0,1175859175,5th Anniversary Year / Five Small Fires,Conceptual Art,art,"{""id"":20,""name"":""Conceptual Art"",""slug"":""art/c...",https://www.kickstarter.com/discover/categorie...,5th Anniversary Year | FSF is about revolution...,5th-anniversary-year-five-small-fires,5000,5271,...,US,USD,53,0,0,1,0,successful,105.42,th anniversary year fsf revolutionary abstract...


In [24]:
#append publishing_corpus blurb to publishing_df
publishing_df['blurb_corpus'] = publishing_corpus_df['blurb']
print(publishing_df.shape)
publishing_df.head(1)

(12487, 23)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,name,genre,subgenre,category,source_url,blurb,slug,goal,converted_pledged_amount,...,country,currency,backers_count,disable_communication,is_starrable,spotlight,staff_pick,state,success_percentage,blurb_corpus
0,814040741,"THEMA issue ""One Thing Done Superbly""",Periodicals,publishing,"{""id"":49,""name"":""Periodicals"",""slug"":""publishi...",https://www.kickstarter.com/discover/categorie...,THEMA Literary Society proposes to publish One...,thema-issue-one-thing-done-superbly,4000,891,...,US,USD,18,0,0,0,0,failed,22.275,thema literary society proposes publish one th...


In [25]:
#append food_corpus blurb to food_df
food_df['blurb_corpus'] = food_corpus_df['blurb']
print(food_df.shape)
food_df.head(1)

(9494, 23)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,id,name,genre,subgenre,category,source_url,blurb,slug,goal,converted_pledged_amount,...,country,currency,backers_count,disable_communication,is_starrable,spotlight,staff_pick,state,success_percentage,blurb_corpus
0,1320153611,End of the Vine: Wine Arbitrage,Drinks,food,"{""id"":307,""name"":""Drinks"",""slug"":""food/drinks""...",https://www.kickstarter.com/discover/categorie...,Drink fine wine for less - we'll do the work.,end-of-the-vine-wine-arbitrage,25000,61370,...,US,USD,35,0,0,1,0,successful,245.48,drink fine wine le work


In [26]:
#write music_df to a .csv
music_df.to_csv('music_data.csv')

In [27]:
#write film_df to a .csv
film_df.to_csv('film_data.csv')

In [28]:
#write art_df to a .csv
art_df.to_csv('art_data.csv')

In [29]:
#write publishing_df to a .csv
publishing_df.to_csv('publishing_data.csv')

In [30]:
#write comics_df to a .csv
food_df.to_csv('food_data.csv')

### End of Step 5