## Intro to using Pandas - with healthcare for all data

NumPy is a Python library used for working with arrays.

It also has functions for working in domain of linear algebra, fourier transform, and matrices.

NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.

NumPy stands for Numerical Python.

In [None]:
#step 1 import your libraries / packages - try running, if not installed use conda / pip to install 
# eg conda install -c anaconda numpy or pip install numpy 
import numpy as np

In [None]:
#should be included with anaconda installation 
import pandas as pd

### get data 

In [None]:
#bring in the file and convert to a pandas dataframe  
file1 = pd.read_csv(<"insert file name or file path">)

In [None]:
#look at the top rows of the first file using df.head()


In [None]:
#use describe to review whats in the columns and some basic descriptive statistics
file1.describe(include = "all")

In [None]:
#bring in the next file 
file2 = pd.read_csv('file2.txt',sep='\t')
# note the use of separators  necessary for txt files, not for csv files

In [None]:
#look at the top rows of the second file using df.head()


In [None]:
#look at the shape of the second file using df.shape


In [None]:
#read in and review the head/ shape of the remaining two excel files - this time using pd.read_excel, as file3 and file4


### merge the data frames 

after reviewing the column headers for a match we will combine data sources 1 and 2

In [None]:
#lets check the column names for file 1 and 2 using df.columns 


In [None]:
# pull out the column names from file1 as a variable (

column_names=file1.columns

In [None]:
#this command will set those column names in the target data frame 

data=pd.DataFrame(columns=column_names)

In [None]:
#use head() to review what we have 


In [None]:
# next lets concatenate our new target data frame with our first file1
data=pd.concat([data,file1],axis=0)

In [None]:
#use head() to review what we have 


In [None]:
# same again, concat the file 2 into the data df. 


#hint : be careful not to run this more than once or you will have to clear some output! 

In [None]:
#check the shape to ensure you have the correct no of rows
data.shape

In [None]:
#before bringing in the other dfs lets confirm the data columns will line up for all files  

data.columns

#if they dont - what can be done? 

In [None]:
# if happy to proceed, lets concat in file 3 using the same method as before and review the head()



In [None]:
#lets concat in file 4 using the same method as before and review the head()



hint: we are doing this step by step- there are faster methods, to concat all files at once 
     
     we could use something like:  
data = pd.concat([data,file2,file3, file4], axis=0)


In [None]:
#check the shape to ensure you have the correct no of rows and columns


### Standardizing header names

Some standards are:
    use lower case
    if headers have spaces, replace them with underscores 

In [None]:
#lets look at one column header (using the index position)
data.columns[1]

In [None]:
#how many columns do we have?
len(data.columns)

In [None]:
#we want to make all the columns into lower case 
cols = []
for i in range(len(data.columns)):
    cols.append(data.columns[i].lower())
cols

In [None]:
#reset the columns from our function

data.columns = cols

#hint: we could also have used a STR function data.columns=data.columns.str.lower()

In [None]:
#use head() to check the df 



### examining the columns and looking for empty cells

In [None]:
#check the data types of all columns 
data.dtypes

In [None]:
#lets look at how many NA values we have in each column
data.info()


In [None]:
#we can easily create a df of missing values using isna()

missingdata=data.isna()
missingdata.head()

In [None]:
#what about showing the % nulls - heres one technique
#the above data frame is a boolean ie 1s and 0s - so if we add them up we can see a count of actual values 

missingdata.sum()/len(data)

#what does this look like from maths? 
#Summing up all the values in a column and then dividing by the total number is the mean.
#this is the same as missingdata.mean()

In [None]:
#to summarise this in one line of code and round the values 
data.isna().mean().round(4) *100

In [None]:
# lets assume we wanted to drop a single column due to poor coverage across all data sets: domain 

data = data.drop(['domain'],axis=1)

In [None]:
#what we are left with 
data.columns

In [None]:
#this time lets drop a few columns in one hit by choosing which we want to keep - also rearranging the columns

data=data[['controln', 'state', 'gender', 'hv1', 'hvp1',
       'pobc1', 'pobc2','avggift', 'target_d','ic1','ic2','ic3','ic4','ic5']]

In [None]:
#lets rename a few of the columns to sensible names
data = data.rename(columns={ 'controln':'id','hv1':'median_home_val', 'ic1':'median_household_income'})

In [None]:
#review the data using head()


### filtering and subsetting 

In [None]:
#method 1 focus on just male donors in florida 

data[(data["state"]=="FL") & (data["gender"]=='M')]

In [None]:
#alternative method : 

data.query('gender=="M" & state=="FL" ')

In [None]:
# last option 
data.loc[data.gender == "M"]

quick question - are we picking up all males in our data set? 

In [None]:
data['gender'].value_counts()

In [None]:
data['gender'].unique()

In [None]:
#challenge1: view a filtered subset of the data where the average gift is over 10 dollars 



In [None]:
#challenge 2: view a filtered subset of the data where the gender is M, state is florida and avg gift size is more than 10 dollars



In [None]:
#before we apply any lasting filter to our data frame lets create a new version which we can play with
# so that the original data frame wont be affected

tempdata=data.copy()

In [None]:
#use head(), .columns or shape() to review the tempdata



In [None]:
#create a filtered dataframe from our M gender subset 
filtered =data[data['gender']=='M']

In [None]:
#use shape for your filtered df 


In [None]:
#use head to review the top rows of the filtered df


In [None]:
#notice we need to reset the index 
filtered = filtered.reset_index(drop=True, inplace=True)
filtered.head()

In [None]:
#use the index to display the first ten rows 
filtered[1:10]

In [None]:
# use the index to display the first 3 columns, first ten rows 
filtered[['gender', 'ic2', 'ic3']][0:10]

In [None]:
#use the index to display rows with index number 1 and 2 (remember an index starts at 0)


In [None]:
#use loc to return a selected row 
filtered.loc[1]

In [None]:
#use iloc to return first 10 rows and the first 4 columns 

filtered.iloc[1:10,0:4]

loc v iloc 
loc is label based
iloc is integer based only
https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

In [None]:
# tip : its possible to set the index from any column eg CONTROLN

filtered2=filtered.reset_index().set_index('controln')
filtered2.head()
#in which case using iloc and loc would give very different results


In [None]:
filtered2.sort_index()

In [None]:
#when reviewing subsets it is often smart to reconfigure how many rows you will see max

pd.set_option('display.max_rows', 20)

In [None]:
filtered2.loc[:21885]

In [None]:
filtered3.iloc[:21885]

### data cleaning steps (1) data type change 

In [None]:
#reverting back to our temp data copy
# lets look at the data types. any data type worth changing? 
tempdata.dtypes


In [None]:
#focus on float data types 
tempdata.select_dtypes('float64')

In [None]:
#simple change from float to int type would drop the decimals 
data['avggift'] = data['avggift'].astype('int')


In [None]:
#focus on the object data types 
tempdata.select_dtypes('object')

although tempting to force an object into a float using as type :

tempdata['median_home_val'] = tempdata['median_home_val'].astype('float', errors='ignore')

we know some values of the data are strings and this could produce an error 

In [None]:
#lets have a look at all the numeric data 

tempdata._get_numeric_data()

In [None]:
#do a data type change for column and replace non float values with NaN

tempdata['median_home_val'] =  pd.to_numeric(tempdata['median_home_val'], errors='coerce')

In [None]:
# do the same for median household income, ic3 and ic5
