# Hypothese 4: Influence of title and description

1. We think that longer titles + descriptions perform better than shorter ones, because it is understood what the project is for
2. There is an optimal length for titles + descriptions


## Set up + Load

In [None]:
# import the necessary libraries you need for your analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from matplotlib.ticker import PercentFormatter

# import sklearn items
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import make_pipeline

# Import TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer,  CountVectorizer

RSEED = 42

# set general params
plt.rcParams.update({ "figure.figsize" : (10, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
# Floats (decimal numbers) should be displayed rounded with 1 decimal place
pd.set_option('display.float_format', lambda x: '%.1f' % x)
# Set style for plots
plt.style.use('fivethirtyeight') 

In [None]:
# Import data 
df = pd.read_csv('data/2_data.csv')

In [None]:
# Create frame with relevant items
df_hypo4 = df[['id','name','name_length','description','description_length','state','category_main']]

In [None]:
df_hypo4['description_length'] = df_hypo4['description_length'].fillna(0)
df_hypo4['description'] = df_hypo4['description'].fillna('Misc')
df_hypo4.info()

In [None]:
# Get dummmies https://datagy.io/pandas-get-dummies/
# Apply get_dummies to target of STATE
target = [] # [] = list {} = dict

dummies = pd.get_dummies(df['state'],columns=target, drop_first=True)
dummies.head()

In [None]:
# Apply get dummies to added feature of CATEGORY
category = [] # [] = list {} = dict

category = pd.get_dummies(df['category_main'],columns=target, drop_first=True)
category.head()

frames = [df_hypo4,category,dummies]
final_data = pd.concat(frames,axis=1)
final_data.head()

# EDA

In [None]:
final_data[['name_length','description_length']].describe()

In [None]:
final_data[['successful','name_length','description_length']].query('successful == 1').describe()

In [None]:
final_data[['successful','name_length','description_length']].query('successful == 0').describe()

In [None]:
print('Median of name length:', final_data[['name_length']].median(),'Median of description length:', final_data[['description_length']].median())

In [None]:
print('Mode of name length:', final_data[['name_length']].mode(),'Mode of description length:', final_data[['description_length']].mode())

In [None]:
his1 = sns.histplot(data=final_data,x='name_length',bins=10,hue='state')

In [None]:
dis1 = sns.displot(data=final_data,x='name_length',hue='state')
print('Mean for description_length is:', round(df['name_length'].mean(),2))
plt.show();

In [None]:
#print('Median:' final_data['description_length'])
his2 = sns.histplot(data=final_data,x='description_length',bins=10,hue='state')

In [None]:
dis2 = sns.displot(data=final_data,x='description_length',hue='state')
print('Mean for description_length is:', round(df['description_length'].mean(),2))
plt.show();

In [None]:
sns.boxplot(data=final_data[['name_length','description_length']])

In [None]:
sns.pairplot(final_data[['state','name_length','description_length']], hue="state", height=3);

In [None]:
correlations = final_data[['name_length','description_length','successful']].corr()
correlations

In [None]:
heat = sns.heatmap(correlations,annot=True)
plt.show();


In [None]:
# Average length of description
nlmean = final_data['name_length'].mean()
print(nlmean)
order = final_data[['category_main','name_length']].groupby('category_main').mean().sort_values('name_length', ascending=False)

b = sns.barplot(data=final_data, x='category_main', y='name_length' , hue='state')
b.set_xticklabels(b.get_xticklabels(),rotation = 90, size = 10)
b.axhline(y=nlmean, color='black', linestyle ="--")
plt.legend(loc='upper right')
plt.ylim(0, 150)
plt.title("Average length of the project title by category")
plt.xlabel(" ")
plt.ylabel("Length of project's description")
plt.show()

In [None]:
# Average length of description
dlmean = final_data['description_length'].mean()
print()

c = sns.barplot(data=final_data, x='category_main', y='description_length' , hue='state')
c.set_xticklabels(b.get_xticklabels(),rotation = 90, size = 10)
c.axhline(y=dlmean, color='black', linestyle ="--")
plt.legend(loc='upper right')
plt.ylim(0, 150)
plt.title("Average length of the project description by category")
plt.xlabel(" ")
plt.ylabel("Length of project's description")
plt.show()

# Hypothesis baseline model 

## Model 1: Calculate logistic regression with only length features

In [None]:
# Defining X and y
y = final_data['successful']
# Initial model logreg
X = final_data[['name_length','description_length']]


In [None]:
X.info()

In [None]:
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RSEED, stratify=y)
X_train.head()

In [None]:
# Scaling with MinMaxScaler

# Try to scale you data with the MinMaxScaler() from sklearn. 
# It follows the same syntax as the StandardScaler.
# Don't forget: you have to import the scaler at the top of your notebook. 

# Scaling with MinMaxScaler
minmax = MinMaxScaler()
X_train_scaled = minmax.fit_transform(X_train)
X_test_scaled = minmax.transform(X_test)

X_train_scaled

It is good practice to choose an evaluation method before running machine learning models - not after. The weighted average F1 score was chosen. The F1 score calculates the harmonic mean between precision and recall, and is a suitable measure because there is no preference for false positives or false negatives in this case (both are equally bad). The weighted average will be used because the classes are of slightly different sizes, and we want to be able to predict both successes and failures.

In [None]:
# Instantiate a logistic regression model with default parameters
logreg = LogisticRegression()
# Fit
model1 = logreg.fit(X_train_scaled,y_train)
# Predict
y_pred = logreg.predict(X_test_scaled)

In [None]:
# model 1 
cr = classification_report(y_test, y_pred)

cm = confusion_matrix(y_test, y_pred)

print(cr)

In [None]:
# Model 1: Confusion matrix using confusion_matrix from sklearn
sns.heatmap(cm, cmap='YlGnBu', annot=True, fmt='d', linewidths=.5);

## Model 2: Lenth + Category

In [None]:
# Follow up model: 
X2 = final_data.drop(['successful','name','description','state','category_main'], axis=1)

In [None]:
# Splitting into train and test sets
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.3, random_state=RSEED, stratify=y)

In [None]:
# Scaling with MinMaxScaler
minmax = MinMaxScaler()
X_train_scaled2 = minmax.fit_transform(X_train2)
X_test_scaled2 = minmax.transform(X_test2)

X_train_scaled2.shape

In [None]:
# Instantiate a logistic regression model with default parameters
logreg = LogisticRegression()
# Fit
model2 = logreg.fit(X_train_scaled2,y_train2)
# Predict
y_pred2 = logreg.predict(X_test_scaled2)

In [None]:
# model 2
cr2 = classification_report(y_test2, y_pred2)

cm2 = confusion_matrix(y_test2, y_pred2)

print(cr2)

In [None]:
# Model 2: Confusion matrix using confusion_matrix from sklearn
sns.heatmap(cm2, cmap='YlGnBu', annot=True, fmt='d', linewidths=.5);

# Model 2: Naive bayes for title and description

## Preprocessing and quality check

In [None]:
# Quality chck for NaNs that might kill the model
final_data[['successful','name','description']].info()

In [None]:
# Since we have descriptions that are not filled, we cannot use these data, so either drop or fill and assign to new df (bayes)
bayes = final_data[['successful','name','description']].dropna() # not working, so use fill as workaround
#bayes = final_data[['successful','name','description']].fillna("Misc")

# Conduct quality check
#bayes.info()


## Model1: Name of project

In [None]:
# Select the features X for Bayes (Xb) and the target (yb)
yb = bayes['successful']
Xb = bayes['name']

# Quality check for yb
yb.shape

In [None]:
# Quality check for Xb
Xb.shape

In [None]:
# Splitting into train and test sets with 30%, since the sample is slightly unbalanced use stratify 
X_trainb, X_testb, y_trainb, y_testb = train_test_split(Xb, yb, test_size=0.3, random_state=RSEED, stratify=y)

In [None]:
X_trainb.head()

In [None]:
# Create pipeline with TfidfVectorizer and multinomial naive Bayes
model_pipeline = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Fit pipeline/model with trainings data
model_pipeline.fit(X_trainb, y_trainb)

In [None]:
# Predict data
y_predb = model_pipeline.predict(X_testb)


In [None]:
# Evaluate 
crb = classification_report(y_testb, y_predb)

cmb = confusion_matrix(y_testb, y_predb)

print(crb)

In [None]:
# Confusion matrix using confusion_matrix from sklearn
sns.heatmap(cmb, cmap='YlGnBu', annot=True, fmt='d', linewidths=.5);

## Model1: Description of project

In [None]:
# Select the features X for Bayes (Xb) and the target (yb)
ybd = bayes['successful']
Xbd = bayes['description']

In [None]:
# Splitting into train and test sets with 30%, since the sample is slightly unbalanced use stratify 
X_trainbd, X_testbd, y_trainbd, y_testbd = train_test_split(Xbd, ybd, test_size=0.3, random_state=RSEED, stratify=y)

In [None]:
# Fit pipeline/model with trainings data
model_pipeline.fit(X_trainbd, y_trainbd)

In [None]:
# Predict data
y_predbd = model_pipeline.predict(X_testbd)

In [None]:
# Evaluate 
crbd = classification_report(y_testbd, y_predbd)

cmbdd = confusion_matrix(y_testbd, y_predbd)

print(crbd)

In [None]:
# Confusion matrix using confusion_matrix from sklearn
sns.heatmap(cmbd, cmap='YlGnBu', annot=True, fmt='d', linewidths=.5);

# Conclusion

The initial hypothesis was that the title and description hold predictive power to wether or not a project will be successful. 

In the initial check a logistic regression was used. The performance metrics of the model that included length of title and description where: 
- Accuracy: initial 0.58, after adding categories 0.64
- F1 Score overall: 0.54, after adding categories 0.63

The main driver for this result is the lack of the model to predict the unsuccessful projects correctly. 
- Considering only the sucessful predictions, we reach a satisfying F1 Score of 0.72. 
- However when looking at the unsuccessul projects we only get a F1 score of 0.52.

Conclusion: Considering also the correlations, there seems to be a slight positive relationship between the length of the title and the outcome of the project. 



In [None]:
print(cr)

Adding the category does help improving precision, but it makes the recall worse. It can be concluded therefore that length needs additional factors to predict well. It was also tried to remove the length of the description, this did not change model greatly. Therefore it was decided to run a naive bayes to see if this holds true. 

In [None]:
print(cr2)

**Naive Bayes (Multinomial)** was trained and it indicates that both the title and the description have a slight predictive power, when combined with the category of the project. 

Considerung only the title the accuracy can be improved to 0.65, mainly driven by the increase in the recall (0.85). The predictions for the negative values can slightly better, but still not good (precision of 0.66).


In [None]:
print(crb)

Looking at only the description it is surprising to learn it also performs pretty good (Accuracy at 0.69) and slightly outperforms when it comes to the negative values (Precision at 0.71 for the true negatives). 

As a conclusion: both features should be included in the final model, since they hold predictive power over the successfulness of a project. 

In [None]:
print(crbd)