<a href="https://colab.research.google.com/github/yagnik99/Funds-Prediction-for-a-Startup/blob/main/Yagnik_Pandya_Capstone_Project_4_Startup_Funding_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Predict whether a startup will get funded in the next three months. </u></b>

## <b> Problem Description </b>

### There has been a staggering growth in investments in young age startups in the last 5 years. A lot of big VC firms are increasingly getting interested in the startup funding space. You are given a task to predict whether a startup will get a funding in the next three months using app traction data and startup details. 

### This funding can be either seed funding, Series A, Series B, so on and so forth. 


## <b> Data Description </b>

### In the file, you have monthly data for the startup entities which contains information about various app traction performance.

### <b> In the given data, you have the following variables:
* ### UUID - Unique Identifier for a single startup entity
* ### Month - Month for which the app data is available
* ### Application category - The category to which an application belongs to
* ### Avg Session Time - Average time of the session in app during the month
* ### Total Session Time - Avg session time / user x Open rate = Total session time
* ### Open_rate - No of times app has been opened by a user
* ### Reach - % of devices having the app installed
* ### Funding_ind - Indicator for a funded startup

## <b>Following variables are only available for funded startups:</b>
* ### Business models -  The business model of the startup                            
* ### City - the city where the startup is based out of
* ### Company Stage - the stage of the company
* ### Feed name - 
* ### Founded year - The year in which the startup was founded
* ### Latest funded date - The latest date in which the startup was funded
* ### MAU - % of reach which opened the app in the given month( Monthly Active Users)
* ### Overview - Overview of the startup
* ### Practice Areas - 
* ### Region - Region where the startup operates
* ### Total Funding - Total amount of funding till the month
* ### Uninstall Rate - Rate of uninstall of applications.



In [1]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
folder = '/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 4/Week 4/Capstone Project 4/data/'

In [4]:
O_data = pd.read_csv(folder + 'data_driven_investment_1.csv')

In [33]:
O_data.head(3)

Unnamed: 0,UUID,month,Application Category,Avg_Session_Time,Business Models,City,Company Stage,Feed Name,Founded Year,Latest Funded Date,MAU,Open_Rate,Overview,Practice Areas,Reach,Region,Total Funding (USD),Total_Session_Time,Uninstall_Rate,funding_ind
0,c1ad38e2d357610c129657d870ede902e8abfcb9,20180801,Books & Reference,,,,,,,,,,,,0.042963,National,,,0.024443,0
1,d841e670d9191f896a0cbc75961920887b375756,20180801,Books & Reference,,,,,,,,,,,,0.003244,National,,,0.009828,0
2,0a59d138b3eaccd22b665eae70c756fef83ddb63,20180801,Finance,,,,,,,,,,,,0.000161,National,,,0.235294,0


In [6]:
# Shape of Original Dataset
O_data.shape

(1502175, 20)

In [7]:
O_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1502175 entries, 0 to 1502174
Data columns (total 20 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   UUID                  1500693 non-null  object 
 1   month                 1502175 non-null  int64  
 2   Application Category  1502175 non-null  object 
 3   Avg_Session_Time      877732 non-null   float64
 4   Business Models       2414 non-null     object 
 5   City                  2407 non-null     object 
 6   Company Stage         2414 non-null     object 
 7   Feed Name             2414 non-null     object 
 8   Founded Year          2414 non-null     float64
 9   Latest Funded Date    2414 non-null     object 
 10  MAU                   118087 non-null   float64
 11  Open_Rate             877715 non-null   float64
 12  Overview              2414 non-null     object 
 13  Practice Areas        2414 non-null     object 
 14  Reach                 1050875 non-

In [8]:
#Count of Funded Startups
O_data['funding_ind'].value_counts()

0    1499761
1       2414
Name: funding_ind, dtype: int64

In [9]:
O_data.isna().sum()

UUID                       1482
month                         0
Application Category          0
Avg_Session_Time         624443
Business Models         1499761
City                    1499768
Company Stage           1499761
Feed Name               1499761
Founded Year            1499761
Latest Funded Date      1499761
MAU                     1384088
Open_Rate                624460
Overview                1499761
Practice Areas          1499761
Reach                    451300
Region                        0
Total Funding (USD)     1499920
Total_Session_Time       624510
Uninstall_Rate          1149906
funding_ind                   0
dtype: int64

In [22]:
# Checking for duplicates
len(O_data[O_data.duplicated()])

89412

In [23]:
#Droping Duplicates
O_data.drop_duplicates(inplace=True)

In [25]:
# Where UUID is null
len(O_data[O_data['UUID'].isna()])

969

In [28]:
# Duration of Data
print(O_data['month'].min(),'- start')
print(O_data['month'].max(), '- end')

20180701 - start
20200601 - end


In [35]:
O_data[O_data['month'] ==20191201]

Unnamed: 0,UUID,month,Application Category,Avg_Session_Time,Business Models,City,Company Stage,Feed Name,Founded Year,Latest Funded Date,MAU,Open_Rate,Overview,Practice Areas,Reach,Region,Total Funding (USD),Total_Session_Time,Uninstall_Rate,funding_ind
29180,c1ad38e2d357610c129657d870ede902e8abfcb9,20191201,Books & Reference,0.948620,,,,,,,0.645337,2.892748,,,0.020573,National,,2.747994,0.027411,0
29181,d841e670d9191f896a0cbc75961920887b375756,20191201,Books & Reference,1.285343,,,,,,,0.470199,3.580697,,,0.004869,National,,4.121214,0.017159,0
29182,0a59d138b3eaccd22b665eae70c756fef83ddb63,20191201,Finance,2.742948,,,,,,,0.111111,1.914286,,,0.000082,National,,7.232086,0.108911,0
29183,0a59d138b3eaccd22b665eae70c756fef83ddb63,20191201,Books & Reference,1.040757,,,,,,,0.117647,22.712329,,,0.000030,National,,22.515724,0.030303,0
29184,df5fb9891f77df24a91d039f1817c6c4e79244ef,20191201,Productivity,0.531213,,,,,,,0.517625,3.305187,,,0.001717,National,,1.768690,0.286079,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125699,e0c97561f3f8d175d623edf201deb329542e7f15,20191201,Education,,,,,,,,,1.000000,,,,National,,93.567800,,0
125700,758e1f38f006ac569cc92f4307b41af580c14df5,20191201,Puzzle,,,,,,,,,1.000000,,,,National,,102.225567,,0
125701,678f2b5e85cc409cf27aa93deece6bae47fedb0b,20191201,Casual,,,,,,,,,2.000000,,,,National,,136.455800,,0
125702,14751de39f57ee3b6a8735c9bff2ff06a8bff8d1,20191201,Lifestyle,,,,,,,,,1.000000,,,,National,,79.575591,,0


In [36]:
# Copy of Data for future use
data = O_data.copy()

In [39]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1412763 entries, 0 to 1502174
Data columns (total 20 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   UUID                  1411794 non-null  object 
 1   month                 1412763 non-null  int64  
 2   Application Category  1412763 non-null  object 
 3   Avg_Session_Time      877732 non-null   float64
 4   Business Models       2304 non-null     object 
 5   City                  2297 non-null     object 
 6   Company Stage         2304 non-null     object 
 7   Feed Name             2304 non-null     object 
 8   Founded Year          2304 non-null     float64
 9   Latest Funded Date    2304 non-null     object 
 10  MAU                   118087 non-null   float64
 11  Open_Rate             877715 non-null   float64
 12  Overview              2304 non-null     object 
 13  Practice Areas        2304 non-null     object 
 14  Reach                 961463 non-n

In [40]:
# Features that are only for funded Start up
Funded_features = ['Business Models', 'City', 'Company Stage', 'Feed Name', 'Founded Year', 'Latest Funded Date', 'MAU', 'Overview', 'Practice Areas', 'Total Funding (USD)', 'Uninstall_Rate']

In [41]:
data.drop(Funded_features, axis = 1, inplace=True)

In [45]:
data.shape

(1412763, 9)

In [37]:
Non_funded = data[data['funding_ind']==0]

In [38]:
Funded = data[data['funding_ind']==1]