## Problem Statement ##
In this notebook we are going to predict "final_status" of the projects given in test data which will give help us to identify if project would be successfully funded or not.

In [None]:
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import re
import datetime
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn import preprocessing

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
train=pd.read_csv("../input/train.csv")
test=pd.read_csv("../input/test.csv")

In [None]:
pd.set_option('display.max_colwidth',100)

In [None]:
train.head()

In [None]:
train.isnull().sum()
test.isnull().sum()

Let's fill the blank NaNs with " "

In [None]:
train['name'].fillna(" ")
train['desc'].fillna(" ")
test['desc'].fillna(" ")

The columns 'state_changed_at' ,'created_at' and 'launched_at' are in UNIX time format. Let's convert them into Standard time format

In [None]:
col_date=['state_changed_at','created_at','launched_at','deadline']

for i in col_date:
    train[i]=train[i].apply(lambda x: datetime.datetime.fromtimestamp(int(x)).strftime("%Y-%m-%d %H:%M:%S"))
    
for i in col_date:
    test[i]=test[i].apply(lambda x: datetime.datetime.fromtimestamp(int(x)).strftime("%Y-%m-%d %H:%M:%S"))

## Exploratory Analysis ##

In [None]:
sns.countplot(x='final_status',data=train)

Only 30% of Projects are granted.

In [None]:
sns.countplot(x='disable_communication',data=train, hue='final_status')

In [None]:
train['disable_communication'].value_counts()

In [None]:
goal_1=train['goal'][train['final_status']==1]
goal_0=train['goal'][train['final_status']==0]

print("Average Goal value when Funding is approved: ",goal_1.mean())
print("Average Goal value when Funding is not approved: ",goal_0.mean())

In [None]:
sns.factorplot(x='final_status', y='goal',data=train, hue='disable_communication',kind='bar')

The mean is high for successful Funding.
And For 100% Successful funding "disable_communication" is False.

In [None]:
sns.countplot(x='country',data=train,hue='final_status')

In [None]:
sns.countplot(x='currency',data=train,hue='final_status')

Most of the the application for Funding come from US, GB, Canada and Austrailia.

In [None]:
sns.factorplot(x='final_status',data=train,y='backers_count',kind='bar')

"backers_count" is high for successfully funded projects.

## Feature Engineering ##

Convert date columns into datetime format to process.

In [16]:
col_date=['state_changed_at','created_at','launched_at','deadline']

for i in col_date:
    train[i]=train[i].apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
    
for i in col_date:
    test[i]=test[i].apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))

Create a new feature **'launched _before_deadline'** which will tell you how early project has been launched.

In [None]:
d1=[]
d2=[]
for i in range(len(train['deadline'])):
        d1.append((train['deadline'].iloc[i]-train['launched_at'].iloc[i]).total_seconds())
        
for i in range(len(test['deadline'])):
        d2.append((test['deadline'].iloc[i]-test['launched_at'].iloc[i]).total_seconds())
        
        
train['lunched_before_deadline']=d1
test['launched_before_deadline']=d2

Create a new feature **'created _before_deadline'** which will tell you how early project has been created from deadline.

In [None]:
d3=[]
d4=[]
for i in range(len(train['deadline'])):
        d3.append((train['deadline'].iloc[i]-train['created_at'].iloc[i]).total_seconds())
        
for i in range(len(test['deadline'])):
        d4.append((test['deadline'].iloc[i]-test['created_at'].iloc[i]).total_seconds())
        
        
train['created_before_deadline']=d3
test['created_before_deadline']=d4

Create a new feature **'state_changed_after_deadline'** which will tell you the state change of the project after the deadline

In [None]:
d5=[]
d6=[]
for i in range(len(train['deadline'])):
        d5.append((train['state_changed_at'].iloc[i]-train['deadline'].iloc[i]).total_seconds())
        
for i in range(len(test['deadline'])):
        d6.append((test['state_changed_at'].iloc[i]-test['deadline'].iloc[i]).total_seconds())
        
        
train['state_changed_after_deadline']=d5
test['state_changed_after_deadline']=d6

In [None]:
train.head()