## The Data Analysis Process

### Assess Data by ask questions about the Data Sets

Find descriptive stats and basically check and familiarise yourself with the data
- Number of samples in data set
- Number of columns
- Number of rows
- Are there duplicates?
- What are the Datatypes?
- Identify features with Missing Data
- Number of non-null unique values for each feature
- What the unique values are and the counts for each

What are the variables and for multiple data sets, do these match?
    - Are they formatted the same?
    - Are they labelled the same?

In [None]:
# import packages and data

import pandas as pd
import numpy as np

# import data set
df = pd.read_csv('file_name.csv')

# cheack the head
df.head()


In [None]:
# find basic info about the data
df.info()

# find the row and col count
df.shape

# find descriptive stats
df.describe()

# check value counts for column (good for categorical data)
df['col_name'].value_counts()

## Check and Drop Duplicates

In [None]:
# find duplicates
df.duplicated().any()

# find the number of duplicates
df_08.duplicated().sum()

# drop duplicates
df.drop_duplicates(inplace=True)

# recheck for duplicates
print(df.duplicated().any())

# find the rows and columns with non_duplicates
df.shape

## Count then Drop NA values

In [None]:
# Check if there are any NAs - returns bool
df_08.isnull().sum().any()

# Count the number of NAs
df.isnull().sum()


In [None]:
# Or this way

df = df.dropna()

In [2]:
# Identify rows with null values

null_data = df_08[df_08.isnull().any(axis=1)]
null_data.count()

In Pandas, when you check information about data types, you will see they are listed as 'object' quite often. 

df.dtypes will show this.

We can check the underlying ACTUAL data type like so


In [None]:

type(df['col_name'][0])


# Cleaning Column Labels

1. Drop extraneous columns
Drop features that aren't consistent (not present in both datasets) or aren't relevant to our questions. Use Pandas' drop function.

In [None]:
# Using iLoc

# create the maximum values dataframe
df_new = df.iloc[:, np.r_[0:2, 22:32]]

# view the first few rows to confirm this was successful
df_new.head()

## Drop Columns

In [None]:
# usind Pandas .drop function
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

# drop columns - inplace=True will update the DF you are working on 
df.drop(['B', 'C'], axis=1, inplace=True)


# or drop row by index
df.drop([0, 1])

## Rename Columns


In [None]:
# assign new labels to columns in dataframe
new_labels = [a,b,c,d]

df.columns = new_labels

# display first few rows of dataframe to confirm changes
df.head()

# replace spaces with underscores and lowercase labels for 2008 dataset
df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)


In [None]:
# confirm column labels for 2008 and 2018 datasets are identical
df_1.columns == df_2.columns

In [None]:
# make sure they're all identical like this
(df_1.columns == df_2.columns).all()

## Fixing Data Types

Extract string and convert to integer

In [None]:
# Extract int from strings in x column
df['x'] = df_08['x'].str.extract('(\d+)', expand=True).astype(int)

df.head(1)

In [None]:
# convert floats in y column to int
df['y'] =  df['y'].astype(int)

## Working with datetime objects

** converting a column into a PANDASdatetime object **

https://pandas.pydata.org/pandas-docs/stable/api.html#datetimeindex


In [None]:
df.date_column = pd.todatetime(df.date_column)

# or

df['date_column'] = pd.to_datetime(df['date_column'], format='%Y%m%d')

In [None]:
# from a datetime objkect, return the weekday using dt from the datetime module
df.date_column.dt.weekday

# that is using dot notation, can also use string
df['date_column'].dt.weekday

# return the month
df.date_column.dt.month

This is useful because now we can append this numerical representation of the weekday or the month as a new column in our data frame

In [None]:
df['weekday'] = weekday

# or
# .assign takes a copy and adds it to the df.

df = df.assign(weekday=weekday)

We don't need the numerical value, instead we want to use the actual name of the weeks we can do this by creating a dictionary of the weekday names and then using a function along with th .apply() merthod to asign to our dataframe

In [None]:
weekday_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday']
weekday_dict = {key:weekday_names[key] for key in range(7)}

# write a function to retund the day of the week based on an integer input

der day_of_week(input):
    return weekday_dict[input]

# use .apply() to apply a custom function to the weekdays column

df.weekdays = df.weekdays.apply(day_of_week)

You can also do this by setting the datetime column to the index and using the weekday_name from that.

see this at 1:22:30
https://www.youtube.com/watch?v=5XGycFIe8qE&feature=youtu.be

In [None]:
df.index.weekday_name

# this will return the weekday name form an index that is a datetime object


## SORT BY INDEX

In [None]:
# we want to sort our data frame based on an ordered list we created earlier(above) of weekday names

# Need to use .loc to sory by index

# group by weekdays
weekday_counts = df.groupby('weekdays').count()

# reorder the dataframe
weekda_counts = wekkday_counts.loc[weekday_names]

### convert string to float


In [None]:
# convert strings to floats
columns = ['col_1', 'col_2', 'col_3']

for c in columns:
    df1[c] = df1[c].astype(float)
    df2[c] = df2[c].astype(float)


### convert int to float

In [None]:
# using astpye()
df['col'] = df['col'].astype(float)

# this may throw an error for example is you have 6/4

# First, search for all of the rows with /
n_df = df[df['col'].str.contains('/')]
n_df

## check conversions of two data sets

In [None]:
df1.dtypes == df2.dtypes
# will return a list of variables with true or false for data types

## Save Your new Data Frame(s)

In [None]:
# save new datasets
df.to_csv('data.csv', index=False)
