In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In this notebook, we'll be analyzing the National Football League (NFL) data. This dataset contains all regular season games from 2009-2016. It contains 356,768 rows and 100 columns.
Well, let's get started!

In [1]:
# Libraries we will need:
import pandas as pd
import numpy as np

In [1]:
nfl_play = pd.read_csv('../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv')

In [1]:
np.random.seed(0)
#Now, let's take a lok at sample data 
nfl_play.sample(10)

Above, we can clearly see, that the dataset has some missing values. Les's find them. 

In [1]:
mis_values = nfl_play.isnull().sum()
mis_values 

We have a fairly large percent of missing value, let's clarify how large...

In [1]:
tot_cells = np.product(nfl_play.shape)
tot_mis = mis_values.sum()

# missing percent
(tot_mis/tot_cells) * 100

As predicted, we have a whole 24.9% of missing value. 

# Dropping Missing Value 

Theoretically, 30% is the maximum missing values are allowed, beyond which we might want to drop the variable from analysis. But practically this varies. At times we get variables with ~50% of missing values but still, the customer insists to have it for analysis.h When the data goes missing on 60–70 percent of the variable, dropping the variable should be considered.



In [1]:
#Let's drop rows with at least one missing value.
nfl_play.dropna()

Now, let's do the same with columns.We'll remove those that have at least one missing value. 

In [1]:
col_with_na_dropped = nfl_play.dropna(axis=1)
col_with_na_dropped.head()


In [1]:
#let's see hom much data did we lose at this point.
print("Columns original dataset: %d \n" % nfl_play.shape[1])
print("Collumns with na's removed: %d" % col_with_na_dropped.shape[1])

All Nan's have been excluded from data. 

# Filling in Missing Values 
Now,instead of removing, we'll try filling in the missing values.


In [1]:
subset_nfl = nfl_play.loc[:, 'EPA':'Season'].head()
subset_nfl

Time to replace Nan's with some value, in our case we'll replace them with 0 

In [1]:
#all Nan's replaced with 0
subset_nfl.fillna(0)

In [1]:
# replace all NA's the value that comes directly after it in the same column,then replace all the remaining na's with 0
subset_nfl.fillna(method='bfill', axis=0).fillna(0)

# Normalization and Scaling

Scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization. We use it during the data preprocessing step.

In [1]:
#Libraries we"ll use:

import pandas as pd
import numpy as np

# for Box-Cox Transf
from scipy import stats

# for min_max scal
from mlxtend.preprocessing import minmax_scaling

# plotting
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(0)

In [1]:
# we'll generate 1000 data points randomly drawn from an exponential distribution
orig_data = np.random.exponential(size=1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(orig_data, columns=[0])

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(orig_data, ax=ax[0])
ax[0].set_title("Original Nfl Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

In [1]:
#We'll normalize the exponerial data with boxcox
normal_data = stats.boxcox(orig_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(orig_data, ax=ax[0])
ax[0].set_title("Original Nfl Data")
sns.distplot(normal_data[0], ax=ax[1])
ax[1].set_title("Normalized Nfl data")

We can clearly see that the shape of our data has changed. It was almost L-shaped. After normalizing it looks more like the outline of a bell curve.