# Review of Project 2: Kickstarter Success

<
## 1) Business Understanding

* Stakeholders: Entrepreneurs planning to run a Kickstarter project
* Key Assignment: Help stakeholders predict whether project will be successful before they launch & invest too many ressources
* Future Work: Predict a realistic pledge goal

## 2) Data Mining

* Dataset was provided as 50 csv files
* Load and concatenate into one dataframe

## 3) Data Cleaning

* Project required extensive data cleaning:
    * various features with missing values
    * various unbalanced features 
* mulitple columns feature stored as a string reprensentation of a dictionary
* extraction of these data from these columns was time consuming


## 4) Data Exploration

* Standard desciptive statistics on features
* Visuals: mainly bar plots
* Histograms, comparison with normal distributions

## 5) Feature Engineering

* different ideas were discussed
* many were beyond the scope of the project
* 'base pledge', that is pledged amount divided by the number of backers turned out to be a very important feature

## 6) Predictive Modeling

* 4 different models were used
* Logistic Regression: very simple, but poor results
* Decision Trees: slightly improved results
* Random Forests: good results, easy to implement, cheap
* xgboost: computationally expensive, results only marginally better than RF with 

## 7) Data Visualization

* A lot of insight was gained by looking at success rates of KS projects for different categories
* Heatmaps were helpful in understanding correlations of features with target variables and multi-collinearities

<a id='BU'></a>
# I business understanding

**A) steps**

a. Business setting

b. Problem statement

c. Business values

d. knowing time frame

e. ressources: time, people, finances, energy/prioritization, hw/sw equipment

f. metric of success, e.g. focus on low false positive (precision), low false negative (recall)

**B) hows**

1. Ask biz stakeholders

2. Own research google/scholar

3. company website

4. Competitor website(s)

5. Company intranet / wiki

6. arxiv

7. What other data scientists have done in same or similar domain

# II Data mining

CSV, json
SQL 
PySpark
webscraping

## III Data Cleaning (after initial look at data)

initial look

a) understanding the features (column names / predictor variables)
    -> figuring out what the dependennt variable / predicted is
    
b) understanding looking at rows / observations for each column / feature

c) Look at shape / size of the data is it
    too large / too small (or just wright)?
    
<font color='red'>
if too small: ask for or get more data

synthetic data generation if imbalanced data set

if too large: should we random sample? 
</font>

d) is data loaded correctly? Example: zip codes starting with a zero

e) do we have multiple datasets?
combine them / truncate them / modify them?

f) checking data type

g) are data on same scale: timezone / currency

h) missing values, how are they encoded? not always are they encoded as "NaN" or "None", sometimees string "missing" "-999" etc.

i) Checking for duplicates, e.g. say rows of observations duplicated

j) is there an unique identifier column

k) Check for inconsistencies, e.g. in housing dataset, house with 33 bedrooms, while is possible, appears unlikely

l) Check for irrelevant: check relevancy of data with problem statement e.g. Drug dataset had a fictional drug that doesnt exist

m) Check for sub/additional datasets, say complete project info is found from *csv *json, dataset, webscraping etc.

a) Renaming removing whitespaces / characters

a1) Convert types, e.g. if features should a date but is actually a string, convert it accordingly

b) Handle missing data:
    impute, i.e. data mean/median etc of column
    delete the feature
    delete observations
    insert a dummy for missing values, e.g. for a categorical variable, we can make an extra dummy column for missing values
    talk to the business
    
  For houseprices project, ew saw we have multicollinear features with missing values so this was telling us, that DELETING it was not best options
    
c) Look for similiar or same data
    too large --> 
    
    rows: take random sample (after cleaning & 
 
    columns/features: do PCA (as part of feature engineering)

d) is data loaded correctly? depending on initial look, we correct the feature values
    correct zipcode values that make sense, substitute -9999 with actual zipcodes (with leading 0)
    
e) Find relationships between datasets & combine them

f) (i) E.g. applied pandas to_datetime() method to convert string to timestamp
   (ii) str reprensentation of dict --> convert str to dict
   (iii) Try to convert to appropriate type, e.g. check if can be converted to numerical variable 
   
g) Categorical data can be encoded differently, e.g. 0/1; yes/no. ensure that for same data feature we encode them using the same manner.

h 

i

j) Handling of duplicates is use case specific!
    Remove duplicates 
    For anomaly detection, we would keep duplicates as they can help identify patterns
    

k) Example of 33 bedrooms --> research it or make an assumption that the correct entry was supposed to be 3. 
Always articulate and comment in Notebook!

l) Please look 

m) Refer to f)



## IV Data Exploration

a) Summary statistics (pandas.DataFrame.describe()), tells us distributions of numerical features/ variables

b) To support (a) above, visualize the numerical features (matplotlib, seaborn, bokeh, folium), plot distribution of y variable

c) Plot the distribution of categorical variables as bar plot or pie plot

d) Explore target / predicted variable, relationship with predictor variables --> scatterplots,
d1) Check for correlation, do heatmap

e) Define distinct queries to get more granular insights on data (SQL)

f) Check outliers from distribution plots (or scatter plots)

g) Make more advanced plot (e.g. plot the success rate of female entrepreneurs by category) to dive deeper into understanding the various features relationships among each other & with the predicted variables

## V. Feature Engineering

a) Transforming skewed continous variables to more normally distributed variables (log(1+x)-transform). 

b) Extract new features from features, say from name feature extract gender
b1) Create new features by combining existing features, e.g. from total sales and number as purchases calculate average size of purchase.

c) Create new categories by either:
    binning numericals variables into categories
    regrouping old categories into new categories
    
d) Scaling data
    
e) Create dummy variables for variables we want to use

f) Dimensionalty reduction, e.g. PCA or KMeansClustering

g) Impute values for features (* this can be a part of data cleaning as well as feature engineering)

h) Assign weights to features, denoting feature **importance**

i) Remove undesired features, drop them


## VI Predictive Modeling

a) Pick models to use that are appropriate for the current challenge + ressources identified in "business understanding"
e.g. in Kaggle competitions winners use stacking as a way to get best performance / evaluation results. In practical world, this is usually not the case.

b) train test split (90-10, 70-30, 80-20)

c) What is the baseline model? E.g. guessing in the case of a binary classification problem. 
Based on predetermined metric of success from <a href='#BU'>B.U.</a> determine which of the evaluation metric. 

e) Run the model
        - K-fold cross validation
        - Grid-Search

f) Evaluate model performance:
        - confusion matrix
        - classification report

g) Compare models (ideally at least against evaluation metrics and baseline model)

h) Indentifying the most important features across all models.

i) * Hypothesis testin for marketing / sales / healthcare domains

j) Visualize the model output based on most imported features, e.g. model predicted probalitties against actual data.

In [None]:
## VII Data Visualizing

In [None]:
   ((ii) create dummies for categorical variables