## <font color='#607c8e'>Startup's Success Analysis</font>
<font color='#cb416b'>Data Science Foundation Program BootCamp<br/></font>
Raquel Câmara Porto

<b>Objective: Predicting the profit of a new Startup based on certain features and deciding whether one should invest in a particular startup or not.</b>

### <font color='#3c4142'>Exploratory Data Analysis</font>

In [2]:
# Importing libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [3]:
# Importing dataset
data = pd.read_csv('50_Startups.csv')
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [4]:
# Checking for missing values and column types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


After that, we are able to identify the <b>features</b> (inputs) and the <b>target</b> (output), as well as the types of data on the dataset:

    1. Target:   Profit ------------ Numerical
    
    2. Features: R&D Spend --------- Numerical
                 Administration ---- Numerical
                 Marketing Spend --- Numerical
                 State ------------- Categorical
                 
Then, we can go for some basic statistical details about both numerical and categorical data.

In [5]:
# Generating descriptive statistics for numerical data
data.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


In [6]:
# Generating descriptive statistics for categorical data
data.describe(include=[object])

Unnamed: 0,State
count,50
unique,3
top,California
freq,17


With these information, we can summarize the central tendency, dispersion and shape of the dataset’s distribution.

In [7]:
# Recognizing the data
data['State'].unique()

array(['New York', 'California', 'Florida'], dtype=object)

Now we can separate our dataset into features and label (target).

In [8]:
# Separating data
features = data.iloc[:, :-1]
target = data.iloc[:, -1]

In [9]:
# Checking features columns
features.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [10]:
# Checking target column
target.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

Having the Explanatory Data Analysis done, we can start preparating our data to be used on our models. 