# KickStarter - NLP Sentiment Analysis and Word Usage or: Help Us to Film a New, Short, Documentary about a Tabletop Miniature Festival (which I am slightly unhappy about)

The dataset for this project contains the English blurb or description of 215513 kickstarter's projects in 2017; 108310 successful and 107203 failed. All this data was collected by webrobots.io, who performed the web scraping. They cleaned and tidied the scraped data, keeping just the two columns with blurbs in english and with final state of "successful" or "failed". My analysis builds on this cleaned state.

> Kickstarter is an American public-benefit corporation based in  Brooklyn, New York, that maintains a global crowdfunding platform  focused on creativity and merchandising. The company's stated mission is to "help bring creative projects to life". Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4  million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects. People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work.

<b>- Wikipedia</b>

The goal of this notebook is to perform sentiment word and usage analysis on successful and failed blurbs, to visualize the results and identify trends in blurb writing that may be useful for future Kickstarter blurb writers.

Data Visualization performed using Plotly


## Import Data to Pandas DataFrame

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_ks=pd.read_csv('../input/kickstarter-nlp/df_text_eng.csv',index_col='Unnamed: 0')
df_ks.dropna(inplace=True)
df_ks.head()

## Sentiment Analysis 
NLTK's Vader Sentiment Analyzer [1] is used to score blurbs on Positive, Neutral or Negative sentiment and a Compound score for overall sentiment


In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sid.polarity_scores(df_ks['blurb'].iloc[2])

Create a function to generate new dataframe features from sentiment analysis scores

In [None]:
def tag_conf_gen(df):
    neg=[]
    neu=[]
    pos=[]
    compound=[]
    for text in df['blurb']:
        result = sid.polarity_scores(text)
        neg.append(result['neg'])
        neu.append(result['neu'])
        pos.append(result['pos'])
        compound.append(result['compound'])
    df['neg']=neg
    df['neu']=neu
    df['pos']=pos
    df['compound']=compound
    return df

In [None]:
df_ks_sent=tag_conf_gen(df_ks)

In [None]:
df_ks_sent.head()

In [None]:
df_ks_sent.describe()

In [None]:
df_ks.loc[df_ks.state=='successful'].describe()

In [None]:
df_ks.loc[df_ks.state=='failed'].describe()

In [None]:
import plotly.graph_objects as go
labels=['Positive','Neutral','Negative']
values=[np.sum([df_ks_sent['compound']>0]),np.sum([df_ks_sent['compound']==0]),np.sum([df_ks_sent['compound']<0])]
fig = go.Figure(data=[go.Pie(labels=labels, values=values,marker_colors=['blue','yellow','red'])])
fig.update_layout(title_text="Sentiment of Kickstarter blurbs",)
fig.show()

In [None]:
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

for n,s in enumerate(['successful','failed']):
    df=df_ks_sent.loc[df_ks_sent['state']==s]
    labels=['Positive','Neutral','Negative']
    values=[np.sum([df['compound']>0]),np.sum([df['compound']==0]),np.sum([df['compound']<0])]
    colors=['blue','yellow','red']
    fig.add_trace(go.Pie(labels=labels, values=values, name=s, marker_colors=colors),
              1, n+1)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent")

fig.update_layout(
    title_text="Sentiment of Successful and Failed Kickstarter blurbs",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Successful', x=0.16, y=0.5, font_size=12, showarrow=False),
                 dict(text='Failed', x=0.8, y=0.5, font_size=12, showarrow=False)])
fig.show()

In [None]:
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]])

labels=['Successful','Failed']
colors=['green','darkred']

df=df_ks_sent.loc[df_ks_sent['compound']>0]
values=[np.sum([df['state']=='successful']),np.sum([df['state']=='failed'])]
fig.add_trace(go.Pie(labels=labels, values=values, name=s, marker_colors=colors),
              1, 1)

df=df_ks_sent.loc[df_ks_sent['compound']==0]
values=[np.sum([df['state']=='successful']),np.sum([df['state']=='failed'])]
fig.add_trace(go.Pie(labels=labels, values=values, name=s, marker_colors=colors),
              1, 2)

df=df_ks_sent.loc[df_ks_sent['compound']<0]

values=[np.sum([df['state']=='successful']),np.sum([df['state']=='failed'])]
fig.add_trace(go.Pie(labels=labels, values=values, name=s, marker_colors=colors),
              1, 3)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent")

fig.update_layout(
    title_text="Sentiment of Successful and Failed Kickstarter blurbs",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Positive', x=0.1, y=0.5, font_size=10, showarrow=False),
                 dict(text='Neutral', x=0.5, y=0.5, font_size=10, showarrow=False),
                 dict(text='Negative', x=0.9, y=0.5, font_size=10, showarrow=False)])
fig.show()

In [None]:
fig=go.Figure(data=[go.Histogram(x=df_ks_sent['compound'])])
fig.update_layout(title_text='Sentiment Histogram')
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=df_ks_sent.loc[df_ks_sent['state']=='successful']['compound'],nbinsx=50,name='successful'))
fig.add_trace(go.Histogram(x=df_ks_sent.loc[df_ks_sent['state']=='failed']['compound'],nbinsx=50,name='failed'))
# Overlay both histograms
fig.update_layout(barmode='overlay',title_text='Successful and Failed Sentiment Histograms')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.35)
fig.show()

In [None]:
import plotly.figure_factory as ff
data=[df_ks_sent.loc[df_ks_sent['state']=='successful']['compound'],df_ks_sent.loc[df_ks_sent['state']=='failed']['compound']]
labels=['successful','failed']
fig = ff.create_distplot(data, labels, bin_size=.1)
fig.update_layout(title_text='Successful and Failed Sentiment Distplots')
fig.show()

Statistical analysis on Compound scores for successful vs. failed blurbs
Null Hypothesis: The Compound Sentiment of successful and failed blurbs have identical distributions
Alternate Hypothesis: The Compound Sentiment of successful and failed blurbs have different distributions
P-value threshold: 0.05

In [None]:
from scipy import stats
t_stat, p= stats.ttest_ind(df_ks_sent.loc[df_ks_sent['state']=='successful']['compound'],df_ks_sent.loc[df_ks_sent['state']=='failed']['compound'],equal_var=False)
print('T-Statistic: ',round(t_stat,2))
print('P-value: ',p)

P-Value well below a threshold for rejecting the null hypothesis that the populations have the same Compound Senitment distribution<br>
Negative T-Statistic indicates that the mean Compound Sentiment of successsful blurbs is **lower** than failed blurbs

In [None]:
df_ks_sent['state_binary']=df_ks_sent['state']=='successful'
df_ks_sent.corr()

In [None]:
fig=go.Figure(data=go.Heatmap(z=df_ks_sent.corr(),
                             x=df_ks_sent.corr().columns,
                             y=df_ks_sent.corr().index,
                             xgap=5,
                             ygap=5,
                             colorscale=[[0.0, "rgb(300,100,100)"],
                [0.4, "lightpink"],
                [0.45, "white"],
                [0.5,"lightblue"],
                [1.0, "rgb(100,100,300)"]]))
fig.update_layout(title_text='Sentiment Correlation Heatmap')
fig.show()

Correlation matrix indicates a very slight negative correlation between compound sentiment successful blurbs

## Word Usage Analysis

blurbs are tokenized using NLTK's Tokenizer tool. Stopwords, punctuation and numbers are all removed.  Word Frequency is determined and compared for Successful and Failed blurbs.

In [None]:
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string
import re
 

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
nltk.download('stopwords')
stopwords_list = stopwords.words('english')
stopwords_list += [string.punctuation]
stopwords_list += ['0','1','2','3','4','5','6','7','8','9']


Determine Word Frequency Distribution for successful and failed blurbs

In [None]:
succ_blurbs=df_ks_sent.loc[df_ks_sent['state']=='successful']['blurb'].values
succ_flat=' '.join(succ_blurbs)

succ_words= nltk.regexp_tokenize(succ_flat,pattern)
succ_tokens = [word.lower() for word in succ_words]


succ_tokens_stopped = [word for word in succ_tokens if word not in stopwords_list]

succ_freqdist = FreqDist(succ_tokens_stopped)
succ_freqdist.most_common(50)

In [None]:
fail_blurbs=df_ks_sent.loc[df_ks_sent['state']=='failed']['blurb'].values
fail_flat=' '.join(fail_blurbs)

fail_words= nltk.regexp_tokenize(fail_flat,pattern)
fail_tokens = [word.lower() for word in fail_words]

fail_tokens_stopped = [word for word in fail_tokens if word not in stopwords_list]

fail_freqdist = FreqDist(fail_tokens_stopped)
fail_freqdist.most_common(50)

Word Frequency DataFrame is created with features 'diff' and 'ratio' for (success-failure) and (success/failure) usage measures

In [None]:
succall_df=pd.DataFrame.from_dict(dict(succ_freqdist),orient='index',columns=['freq_success'])
failall_df=pd.DataFrame.from_dict(dict(fail_freqdist),orient='index',columns=['freq_fail'])
df_freqdist_all=succall_df.join(failall_df,how='left').fillna(0)
df_freqdist_all['diff']=df_freqdist_all['freq_success']-df_freqdist_all['freq_fail']
df_freqdist_all['ratio']=df_freqdist_all['freq_success']/df_freqdist_all['freq_fail']
df_freqdist_all['freq_fail']=np.negative(df_freqdist_all['freq_fail'])
df_freqdist_all.head()

In [None]:
df_freq_succ20=df_freqdist_all.sort_values(by='freq_success',ascending=False).head(20)
df_freq_succ20

In [None]:
fig=go.Figure()
fig.add_trace(go.Bar(x=df_freq_succ20['freq_success'],y=df_freq_succ20.index,orientation='h',marker_color='blue',name='Usage - successful'))
fig.add_trace(go.Bar(x=df_freq_succ20['freq_fail'],y=df_freq_succ20.index,orientation='h',marker_color='red',name='Usage - failed'))
fig.add_trace(go.Bar(x=df_freq_succ20['diff'],y=df_freq_succ20.index,orientation='h',marker_color='green',name='Usage - net difference'))
fig.update_traces(opacity=.6)
fig.update_layout(barmode='overlay',yaxis=dict(autorange="reversed"),title_text='Most Commonly Used Words In Successful Blurbs Totals and Difference')
fig.show()

The most commonly used words most used in successful blurbs has a lot of crossover with the most commonly used words in failed blurbs. 

Some words, however, are used much more frequently by the successful blurbs than the failed blurbs. Two measures of this are below. The words with the highest net difference in usage by successful blurbs minus failed blurbs and the words with the highest ratio of usage by successful blurbs over failed blurbs (with a minimum usage of 500)

In [None]:
df_freq_diff20=df_freqdist_all.sort_values(by='diff',ascending=False).head(20)
df_freq_diff20

In [None]:
fig=go.Figure()
fig.add_trace(go.Bar(x=df_freq_diff20['freq_success'],y=df_freq_diff20.index,orientation='h',marker_color='blue',name='Usage - successful'))
fig.add_trace(go.Bar(x=df_freq_diff20['freq_fail'],y=df_freq_diff20.index,orientation='h',marker_color='red',name='Usage - failed'))
fig.add_trace(go.Bar(x=df_freq_diff20['diff'],y=df_freq_diff20.index,orientation='h',marker_color='green',name='Usage - net difference'))
fig.update_traces(opacity=.6)
fig.update_layout(barmode='overlay',yaxis=dict(autorange="reversed"),title_text='Words With Largest Usage Difference Between Successful and Failed Blurbs')
fig.show()

In [None]:
df_freq_ratio20=df_freqdist_all.loc[df_freqdist_all['freq_success']>500].sort_values(by='ratio',ascending=False).head(20)
df_freq_ratio20

In [None]:
fig=go.Figure()
fig.add_trace(go.Bar(x=df_freq_ratio20['freq_success'],y=df_freq_ratio20.index,orientation='h',marker_color='blue',name='Usage - successful'))
fig.add_trace(go.Bar(x=df_freq_ratio20['freq_fail'],y=df_freq_ratio20.index,orientation='h',marker_color='red',name='Usage - failed'))
fig.add_trace(go.Bar(x=df_freq_ratio20['diff'],y=df_freq_ratio20.index,orientation='h',marker_color='green',name='Usage - net difference'))
fig.update_traces(opacity=.6)
fig.update_layout(barmode='overlay',yaxis=dict(autorange="reversed"),title_text='Words With Largest Usage Ratio Between Successful and Failed Blurbs (Minimum 500 Usages)')
fig.show()

Notes on word meanings:<br>
"th" is most like a leftover from removing the numbers eg. "4th" becomes just "th". From the above charts this would indicate successive campaigns i.e those that merits a 4th campaign have 3 prior successful campaigns and therefore are likely to have returning backers.
"mm" may relate to miniatures and tabletop games, which are have very high success to failure differences and ratios 

## Conclusions

The Sentiment of Kickstarter blurbs is mostly positive, however this seems to actually be detrimental to success as demonstrated in the t-test.<br>
Interestingly, things that are "new" and "first" have high positive success differences and so do successive projects to previous kickstarters i.e things that merit a 'th'. Being both new AND repetitious are postive attributes <br>
"Tabletop", "miniature" and "dice" have incredibly high success rates as do "film", especially a "documentary", as do "dance" projects.<br>
"Us" and "We're" both have significant positive differences and ratios indicating backers are in favour of group efforts.<br>
Phrasing does appear to matter somewhat - "music" projects are mostly unsuccessful but "albums" are mostly successful. <br>

Asking to <b>"help us film a new, short, documentary about a tabletop miniature festival"</b> is a sure-fire hit! 

[1] Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.