# Module 10: Descriptive Statistics and Outliers
***

## What are statistics?

Statistics deal with collecting informative data, interpreting these data, and drawing conclusions about the data. Statistics are used across multiple disciplines and fuel business and policy decisions.

## Two types of statistics
***
#### Descriptive Statistics
Descriptive statistics use numeric and graphical techniques to summarize the characteristics of a set of data and to identify patterns within the dataset. Descriptive statistics also include graphically summarizing data and presenting the information in a meaningful (easy to interpret) way. 

#### Inferential Statistics
Inferential statistics use samples of data to make generalizations about a larger set of data (or a larger population). Inferential statistics use data to make estimates, decisions, and predictions about patterns and trends identified within a set of data. Inferential statistics are used to make conclusions beyond your immediate set of data. 

## Import Libraries and Data

In [54]:
import pandas as pd
import numpy as np
import scipy.stats as stats ## new library alert! ##

## SciPy is used for scientific and mathematic computing; it provides functions for stats, 
# data manipulation, and data visualization

df = pd.read_excel("axisdata.xlsx")

df.head()

Unnamed: 0,Fname,Lname,Position,Gender,Hours Worked,SalesTraining,Years Experience,Cars Sold
0,Jackie,Jackson,Trainee,F,32,Y,1,3
1,Mary,Patterson,Trainee,F,43,N,1,6
2,Tanya,Adams,Trainee,F,28,Y,1,2
3,Tanya,Henderson,Trainee,F,24,Y,1,5
4,Walter,Franklin,Trainee,M,25,Y,1,4


## Qualitative vs Quantitative Data
***

#### Quantitative Data
Data that are recorded on a numeric scale (i.e. age, weight, salary, number of empty seats on a plane, credit score etc). Quantitative variables are also called <b>numeric</b> variables. 

#### Qualitative Data
Data that cannot be measured on a numeric scale (i.e. eye color, type of car, blood type, race etc). These data can only be classified into one group of categories. Qualitative variables are also called <b>categorical</b> variables. Qualitative data also includes text and string data. 

In [55]:
## Which variables are qualitative and which variables are quantitative?

df.head()

Unnamed: 0,Fname,Lname,Position,Gender,Hours Worked,SalesTraining,Years Experience,Cars Sold
0,Jackie,Jackson,Trainee,F,32,Y,1,3
1,Mary,Patterson,Trainee,F,43,N,1,6
2,Tanya,Adams,Trainee,F,28,Y,1,2
3,Tanya,Henderson,Trainee,F,24,Y,1,5
4,Walter,Franklin,Trainee,M,25,Y,1,4


## Describing Qualitative Data
***

Qualitative data is best summarized by frequency tables. 

Frequency tables summarize categorical data into a table showing each category, number of observations that fall within each category, and the relative frequency of each category. 

Relative frequency is the number of specific observations out of the total number of observations, or the percentage of the total sample that fall within a specific group. Both representations of categorical data give insight into how common specific categories are within a set of data. 

<b>Bar charts</b> are ideal for graphically representing qualitative data as they visually show the frequency of observations across various groups. 

#### GRAPHIC SHOULD READ N = 150 !!!

<img src="FreqTable.png">

In [56]:
## Frequencies 

df["Gender"].value_counts()

M    510
F    489
Name: Gender, dtype: int64

In [57]:
## Relative Frequencies 

df["Gender"].value_counts(normalize=True)

M    0.510511
F    0.489489
Name: Gender, dtype: float64

In [61]:
## Multi-variable Frequencies 

pd.crosstab(df["Gender"], df["SalesTraining"], margins=False, normalize=False)

SalesTraining,N,Y
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,212,277
M,201,309


# { Exercise 1 }

    1. Import the "pokemon.csv" file; name the dataset 'poke'. Preview the first 5 rows. 
    2. Determine the frequencies of unique pokemon types (variable name: Type 1) in this dataset
    3. Determine the relative frequencies of the unique pokemon types (variable name: Type 1) in this dataset
    4. Create a crosstabs table that shows the frequency of pokemon stage (variable name: Stage) and legendary status (variable name: Legendary). 

In [6]:
## always start with importing the libraries you need! ## 

import pandas as pd

## new library ##

import numpy as np

In [67]:
## similar process with slightly different code

poke = pd.read_csv("pokemon.csv")

poke.head()

Unnamed: 0,Num,Name,Type 1,Type 2,Total,HP,Attack,Defense,SpAtk,SpDef,Speed,Stage,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,2,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,3,False
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,2,False


In [8]:
poke["Type 1"].value_counts()

Water       28
Normal      22
Poison      14
Grass       12
Fire        12
Bug         12
Electric     9
Rock         9
Ground       8
Psychic      8
Fighting     7
Ghost        3
Dragon       3
Fairy        2
Ice          2
Name: Type 1, dtype: int64

In [9]:
poke["Type 1"].value_counts(normalize = True)

Water       0.185430
Normal      0.145695
Poison      0.092715
Grass       0.079470
Fire        0.079470
Bug         0.079470
Electric    0.059603
Rock        0.059603
Ground      0.052980
Psychic     0.052980
Fighting    0.046358
Ghost       0.019868
Dragon      0.019868
Fairy       0.013245
Ice         0.013245
Name: Type 1, dtype: float64

In [10]:
pd.crosstab(poke["Legendary"], poke["Stage"], margins = True, normalize = False)

Stage,1,2,3,All
Legendary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,75,56,16,147
True,4,0,0,4
All,79,56,16,151


## Describing Quantitative Data

There are many more techniques for describing quantitative data compared to qualitative. When summarizing quantitative data, the objective is to get a sense of how the data is clustered and the variation in numeric values. 

***

### Measures of Central Tendency

Measures of Central Tendency are estimates of what a "typical" value/data point is within your set of data. A measure of central tendency is a number around which an entire sample of data is spread. We can describe this central position using several statistics, including the mean (average), the median, and the mode. 

Measures of central tendency can be visualized with a frequency distribution (histogram). A frequency distribution is a description of values and how often each value occurs within your dataset. When you graph a frequency distribution a pattern/shape should give insight into which values occured the most/least. 

<img src="NormalDist.jpg">

***

#### Mean

The mean, which is the most commonly known and used measure of central tendency, is the arithmetic average of a set of numeric values. The mean is calculated by adding all the values together and dividing by the total number of values in the set of values. In specific sets of data, the mean is the central point in the frequency distribution aka the hump. 

In [11]:
## Calculate the Mean of a column 

df["Hours Worked"].mean()

33.727727727727725

#### Median 

The median is the value which divides the data into two equal parts when the data is arranged from least to greatest. The median will be the middle value if the number of values is odd, otherwise, it will be the average of the two middle values if the number of values is even. 

In [62]:
## Calculate the Median of a column 

df["Hours Worked"].median()

34.0

#### Mode 

The mode is the value that appears most frequently in a sample of data, or the value that has the highest frequency. It is possible for a set of data to have no mode, and it is possible for a set of data to have multiple modes. 

In [13]:
## Calculate the Mode of a column 

df["Hours Worked"].mode()

0    38
Name: Hours Worked, dtype: int64

# { Exercise 2 }

    1. Using the pokemon dataset ('poke'), calculate the mean attack value (Attack) in the dataset.  
    2. Determine the median defense value (Defense). Write a single sentence on what your findings mean. 
    3. Determine the mode of the speed (Speed) column. Write a single sentence on what your findings mean. 

In [68]:
## the average value in the Attack Column

poke["Attack"].mean()

72.54966887417218

In [64]:
## the middle value when arranged orderly in the Defence Column

poke["Defense"].median()

65.0

In [65]:
## the most frequent valus in the Speed Column
poke["Speed"].mode()

0    70
1    90
Name: Speed, dtype: int64

### Data Distribution and Measurement Sensitivity
***

### Normally distributed data vs. non-normally distributed data

Several statistics are suited for data that follows a <b>normal distribution</b>. When visualized, normally distributed data lies in a symmetrical pattern (bell-shape) where the majority of the data surrounds the mean with decreasing amounts evenly distributed to the left and the right. 

While this is the ideal, real-world data is not always normally distributed. Non-normal distribution is possible and there are several statistics that can still be used with non-normally distributed data. It is important to determine how your data is distributed before you embark on certain statistics. 

#### Distribution and Measure of Central Tendency

The shape of the data helps to determine which measure of central tendency should be reported. Some statistics are very sensitive to non-normal distributions, while others are more resilient. Three of the most common shapes to learn are symmetric, left-skewed, and right-skewed. <b>Skew</b> is simply a measure of the asymmetry of the distribution. <b>In the below images, while not shown, the values are increasing from left to right.</b> 

#### Symmetric Data
* mean, median, and mode are all the same here
* no skewness is apparent
* the distribution is described as symmetric

<img src="Symmetrical.jpg">

***

#### Left-skewed Data
* mean < median
* long tail on the left

<img src="LeftSkew.jpg">

***

#### Right-skewed Data
* mean > median
* long tail on the right

<img src="RightSkew.jpg">

***

### Considerations for Shape

When your data is skewed in either direction, you should consider the best measure of central tendency to report. The median is best used when you have skewed data. This is because the median is less affected by extreme values. 

If your data is skewed, this could be an indication that your data has extreme high or low values that are contributing to the skewness. Later in this lesson, we will talk about how to handle extreme values. 

### Measures of Variability (or Spread)

Measures of Variability, or spread, describe the amount of dispersion in your data sample. In other words, measures of variability, describe how spread out values are. For example, in a sample of 100 students, the average test score is 75. However, not all students will have gotten a 75 - their scores will be spread out above and below the average, some will be higher and some lower. To describe the spread of our data, we can use several statistics, including range, standard deviation, percentiles, and quartiles.

***

#### Range

The range is the difference between the lowest and highest value in a set of data. For example, if the maximum value is 10 and the minimum value is 2, the range is 8. The larger the range, the larger the difference between the highest and lowest values. 

In [14]:
## Calculate the range of values in a column

hrs_range = df['Hours Worked'].max() - df['Hours Worked'].min()

print(hrs_range)

28


#### Standard Deviation (SD)

The SD is a measurement of the average distance between each data value and the mean. You can consider SD as a measure of how spread out the data is from the mean. A low SD indicates that the data points are clustered around the mean, while a high SD indicates the data points are spread out over a wider range of values. 

<img src="https://d20khd7ddkh5ls.cloudfront.net/high_low_standard_deviation.png">

In [15]:
## Calculate the Standard Deviation of values in a column

print("Mean hours worked by all employees:", df["Hours Worked"].mean())

df["Hours Worked"].std()

Mean hours worked by all employees: 33.727727727727725


8.223453795492198

#### Percentiles

Percentiles give insight into the relation of one data point to other data points within the same dataset. When the data values are arranged from lowest to highest, percentiles represent the position of a value within the list of values. For example, if a value is at the 50th percentile - 50% of the data falls below the value, and 50% fall above the value. If a value is at the 85th percentile - 85% of the data falls below the value, and 15% fall above the value. 

In [16]:
## Calculate percentiles of values in a column
## To determine the value that falls at a specific percentile
## np.percentile(data/column, percentile value)

np.percentile(df['Hours Worked'], 85)

## Determine which value falls at the 85th percentile? 
## 85% of the values in Hours Worked are below this value, 15% are above this value. 

44.0

#### Quartiles 

Quartiles are values that divide a set of data into equal quarters (when the data is ordered from lowest to highest). There are <b>three</b> quartile values that divide the data into four parts (see below). 

<img src="quar.png">

* The first quartile (Q1) is at the 25th percentile (25% below, 75% above)
* The second quartile (Q2) is at the 50th percentile (50% below, 50% above, median)
* The third quartile (Q3) is at the 75th percentile (75% below, 25% above)

<b>Quartiles can be used to calculate the Interquartile Range (IQR) - which can be used to handle outliers in your dataset.</b>

In [17]:
## Determine the three quartiles of a set of data
## A quantile is a series of values, or cut points, that divide a set of data into equal parts 
## A 'quartile' is a specical kind of quantile that divides the data into four equal parts

df['Hours Worked'].quantile(.25) 

#Q1 = .25
#Q2 = .50
#Q3 = .75

27.0

#### All of the Above 

The describe function is perfect for outputting multiple descriptive statistics at once. This will save you multiple steps if you want to look at one or several quantitative variables. 

In [18]:
## Use the describe function to view multiple statistics at the same time

df.describe()

Unnamed: 0,Hours Worked,Years Experience,Cars Sold
count,999.0,999.0,999.0
mean,33.727728,3.026026,3.922923
std,8.223454,1.394709,1.527
min,20.0,1.0,1.0
25%,27.0,2.0,3.0
50%,34.0,3.0,4.0
75%,41.0,4.0,5.0
max,48.0,5.0,7.0


In [19]:
## Use the describe function to isolate information for a single column

df['Hours Worked'].describe()

count    999.000000
mean      33.727728
std        8.223454
min       20.000000
25%       27.000000
50%       34.000000
75%       41.000000
max       48.000000
Name: Hours Worked, dtype: float64

# { Exercise 3 }

    1. Using the pokemon dataset ('poke'), calculate the range for the HP column. 
    2. What is the mean and the standard deviation for the special attack (SpAtk) column?
    3. For the Attack column, determine which value falls at the 80th percentile? Which value falls at the 15th percentile? For each of these values, write a single sentence describing what this means. 
    4. Calculate the second quartile (or the median) for the special defense (SpDef) column. Then, calculate the median of the same column using the method shown above. Are the values the same? 

In [69]:
poke = pd.read_csv("pokemon.csv")

poke.describe()


Unnamed: 0,Num,Total,HP,Attack,Defense,SpAtk,SpDef,Speed,Stage
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,76.0,407.07947,64.211921,72.549669,68.225166,67.139073,66.019868,68.933775,1.582781
std,43.734045,99.74384,28.590117,26.596162,26.916704,28.534199,24.197926,26.74688,0.676832
min,1.0,195.0,10.0,5.0,5.0,15.0,20.0,15.0,1.0
25%,38.5,320.0,45.0,51.0,50.0,45.0,49.0,46.5,1.0
50%,76.0,405.0,60.0,70.0,65.0,65.0,65.0,70.0,1.0
75%,113.5,490.0,80.0,90.0,84.0,87.5,80.0,90.0,2.0
max,151.0,680.0,250.0,134.0,180.0,154.0,125.0,140.0,3.0


In [21]:
HP_range = poke['HP'].max() - poke['HP'].min()

print(HP_range)

240


In [22]:
poke['SpAtk'].mean() 

67.13907284768212

In [23]:
poke['SpAtk'].std()

28.53419930191353

In [24]:
np.percentile(poke['Attack'], 80)

95.0

In [25]:
np.percentile(poke['Attack'], 15)

45.0

In [71]:
## Determine the three quartiles of a set of data
## A quantile is a series of values, or cut points, that divide a set of data into equal parts 
## A 'quartile' is a specical kind of quantile that divides the data into four equal parts

poke['SpDef'].quantile(.50) 

#Q1 = .25
#Q2 = .50
#Q3 = .75

65.0

In [72]:
## the middle value when arranged orderly in the Defence Column
## Here we conclude the Q2 = .50, is also the median value of the dataset pokemon

poke["Defense"].median()

65.0

## Handling and Removing Outliers

***

Outliers are data points that differ significantly from other data points. Outliers are an important feature of a dataset that should be identified and addressed because extreme values can influence the results of your statistics. 

   * In a sample of 5 students, their ages are 25, 22, 97, 23, 21 - 97 is an outlier because the value differs significantly from the remaining data.
   * Without removing outlier, the mean is: 37.6 and the median is: 23
   * When removing the outlier, the mean is: 22.75 and the median is: 22.5

Determining which values are extreme can be challenging when you have a larger dataset. There are several methods that can be used to strategically determine which values are outliers, we will cover two separate methods below. 

In [26]:
df = pd.read_csv("SalaryData.csv")
df.head()

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore
0,Richard,5.1,66029,41,Female,0,Central,Y,2,98.3
1,Nick,5.9,8136300,39,Female,0,Central,Y,3,105.7
2,Morgan,9.5,116969,50,Female,1,Central,N,1,145.0
3,Susan,3.2,64445,27,Male,0,Central,Y,3,75.1
4,Matthew,8.2,113812,47,Female,1,Central,N,2,131.1


## Method 1 : Using z-scores to detect and remove outliers

In statistics, the <font color=salmon><b>Empirical Rule</b></font> estimates how data, following a normal distribution, will spread around the mean. According to this rule, basically all data values will fall within 3 SD's above or below the mean. The Empirical Rule is also known as the <font color=salmon><b>68-95-99.7 Rule</b></font> because according to this rule, your data will typically fall into the following pattern:

* 68% of the data falls within one SD of the mean

* 95% of the data falls within two SD's of the mean

* 99.7% of the data falls within three SD's of the mean

<img src="SD Bell Curve.jpeg">

#### Empirical Rule Example:

The average weight of an American man is 175 lbs with a SD of 10 lbs. According to the Empirical Rule, we can assume: 

* 68% of American men are between 165 and 185 lbs (or +/- 10 lbs from the mean, or 1 SD)
* 95% of American men are between 155 and 195 lbs (or +/- 20 lbs from the mean, or 2 SD's)
* 99.7% of American men are between 145 and 205 lbs (or +/- 30 lbs from the mean, or 3 SD's)
* Typically, data values that fall outside of this range are considered outliers
_______________________ 
A <font color=salmon><b>z-score</b></font>, also known as a standard score, measures the distance of a value from the mean. In other words, the z-score tells you how many SD's a given value is from the mean. Values that fall above the mean will have a positive (+) z-score, while values that fall below the mean will have a negative (-) z-score. Values that are far from the mean will be considered outliers. You can define a value as an outlier if the z-score greater than +/- 3. 
_______________________ 

#### Steps for Detecting and Removing Outliers with z-scores
* Import the scipy library to make use of the z-score function
* Calculate the z-score for each value within a specific column in your dataset
* Transform the z-score to the absolute value of the z-score (number without the +/- sign; this will make our code simpler)
* Add the z-scores to your dataset as a new column
* Remove all rows that have a z-score greater than 3

In [27]:
## Create a copy of your dataset to filter outliers
dfz = df.copy()

## Check original shape of the dataset
print(dfz.shape)

(30, 10)


In [28]:
## Create a new column to contain the z-scores for the employee salary
## Set the new column equal to the absolute value of the z-scores
## To calculate z-score -- stats.zscore(data/column name)
## To determine absolute value of z-score -- np.abs(data/function/options)

dfz["zscore_Salary"] = np.abs(stats.zscore(dfz["Salary"]))

## Preview new column; optional
dfz.head()

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore,zscore_Salary
0,Richard,5.1,66029,41,Female,0,Central,Y,2,98.3,0.269844
1,Nick,5.9,8136300,39,Female,0,Central,Y,3,105.7,3.343557
2,Morgan,9.5,116969,50,Female,1,Central,N,1,145.0,0.247036
3,Susan,3.2,64445,27,Male,0,Central,Y,3,75.1,0.270553
4,Matthew,8.2,113812,47,Female,1,Central,N,2,131.1,0.248449


In [29]:
## Determine the index locations for the rows with zscores that are greater than "3"
z_outliers = dfz.loc[dfz["zscore_Salary"] > 3].index

## Preview list of index values
print(z_outliers)

Int64Index([1, 28], dtype='int64')


In [30]:
## What information can we find at these index locations?

dfz.iloc[[1, 28]]

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore,zscore_Salary
1,Nick,5.9,8136300,39,Female,0,Central,Y,3,105.7,3.343557
28,Bill,7.1,9827300,36,Female,0,East,Y,3,116.8,4.100689


In [31]:
## Drop rows with above index values
dfz = dfz.drop(z_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(dfz.shape)

(28, 11)


In [75]:
dfz.head()

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore,zscore_Salary
0,Richard,5.1,66029,41,Female,0,Central,Y,2,98.3,0.269844
2,Morgan,9.5,116969,50,Female,1,Central,N,1,145.0,0.247036
3,Susan,3.2,64445,27,Male,0,Central,Y,3,75.1,0.270553
4,Matthew,8.2,113812,47,Female,1,Central,N,2,131.1,0.248449
5,Richard,10.3,160,47,Male,1,Central,N,3,152.1,0.299336


In [73]:
dfz.tail()

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore,zscore_Salary
24,Matthew,6.0,93940,33,Male,0,Central,Y,1,104.9,0.257347
25,Smith,9.0,105582,28,Female,0,Central,N,3,133.4,0.252134
26,Alex,2.0,43525,34,Male,0,East,Y,3,65.2,0.27992
27,Richard,4.9,67938,40,Female,0,East,Y,3,96.0,0.268989
29,James,4.0,56957,45,Male,1,West,N,2,88.5,0.273906


## Method 2 : Using the Interquartile Range (IQR) to detect and remove outliers

The <font color=salmon><b>Interquartile Range (IQR)</b></font> is the difference between the first and third quartile in a set of data. The IQR can be considered where the bulk of the data values rest. IQR is calculated by the following equation: <B>IQR = Q3 - Q1</B>
<img src="IQR.png">
_____________________

The IQR can be used to create "fences" or cut-off values that determine which values fall outside of the acceptable data range. For example, we can calculate a lower-limit -- all values falling below this limit will be considered an outlier. The <b>1.5 Rule</b> is a commonly used method for determining the upper and lower cut-off limits. According to this rule, a value is considered an outlier if it falls:

* ... below Q1 - (1.5 x IQR)

* ... above Q3 + (1.5 x IQR)
_____________________

#### 1.5 Rule Example:

A professor is tracking the number of absences for each of their students. They have compiled a list of the number of absences: 

    1, 3, [4], 6, 7, [7], 8, 8, [9], 22, 37 

In this list of data, we can find the following values:

* Q1 = 4
* Q2 (median) = 7
* Q3 = 9

With these values, you can easily calculate the IQR:

* <B>IQR</B> = Q3(9) - Q1(4)
* <B>IQR</B> = 5

Once you have the IQR, you can calculate the upper and lower limits for detecting outliers:

* Upper Limit = Q3(9) + [1.5 * IQR(5)] = <b> 16.5 </b>
* Lower Limit = Q1(4) - [1.5 * IQR(5)] = <b> -3.5 </b>

Values that fall above the upper limit are outliers; values that fall below the lower limit are outliers
_______________________ 

#### Steps for Detecting and Removing Outliers with IQR

* Calculate quartiles
* Calculate IQR
* Determine upper and lower limits
* Drop values less than the lower and/or greater than the upper

In [33]:
## Create a copy of original dataset
dfq = df.copy()

## Check original shape of the dataset
dfq.shape

(30, 10)

In [34]:
## Calculate quartiles
q1 = dfq["Salary"].quantile(.25)
q3 = dfq["Salary"].quantile(.75)

print("Q1:", q1)
print("Q3:", q3)

Q1: 48265.0
Q3: 104512.0


In [35]:
## Calculate the IQR
iqr = q3 - q1

print("IQR:", iqr)

IQR: 56247.0


In [36]:
## Determine outlier fences 
top = q3 + (iqr * 1.5)
bottom = q1 - (iqr * 1.5)


print("Upper Limit:", top)
print("Lower Limit:", bottom)

Upper Limit: 188882.5
Lower Limit: -36105.5


In [37]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = dfq.loc[(dfq['Salary'] > top) | (dfq['Salary'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([1, 12, 28], dtype='int64')


In [38]:
## what values can we find at these index locations?

dfq.iloc[[1, 12, 28]]

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore
1,Nick,5.9,8136300,39,Female,0,Central,Y,3,105.7
12,Thomas,1.5,377310,46,Male,1,Central,Y,1,63.8
28,Bill,7.1,9827300,36,Female,0,East,Y,3,116.8


In [39]:
## Drop rows with above index values
dfq = dfq.drop(iqr_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(dfq.shape)

(27, 10)


In [40]:
dfq.head()

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore
0,Richard,5.1,66029,41,Female,0,Central,Y,2,98.3
2,Morgan,9.5,116969,50,Female,1,Central,N,1,145.0
3,Susan,3.2,64445,27,Male,0,Central,Y,3,75.1
4,Matthew,8.2,113812,47,Female,1,Central,N,2,131.1
5,Richard,10.3,160,47,Male,1,Central,N,3,152.1


In [76]:
dfq.tail()

Unnamed: 0,Employee,YearsExperience,Salary,Age,Gender,RemoteStatus,Region,CompanyCar,Shift,PerformanceReviewScore
24,Matthew,6.0,93940,33,Male,0,Central,Y,1,104.9
25,Smith,9.0,105582,28,Female,0,Central,N,3,133.4
26,Alex,2.0,43525,34,Male,0,East,Y,3,65.2
27,Richard,4.9,67938,40,Female,0,East,Y,3,96.0
29,James,4.0,56957,45,Male,1,West,N,2,88.5


# { Module 10 Homework }

1. Import the "babies.xlsx" dataset. See below for information on the columns:

    * bwt - birth weight of newborn baby
    * gestation	- gestation length (weeks)
    * parity - previously pregnant (0 = no; 1 = yes)
    * age - age of mother
    * height - height of mother (inches)	
    * weight - weight of mother (pounds)
    * smoke - smoking status of mother (0 = nonsmoker; 1 = smoker)

In [87]:
df = pd.read_excel("babies.xlsx")

df

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,77.0,1.0
3,123,,0,36.0,69.0,190.0,0.0
4,108,282.0,0,23.0,67.0,125.0,1.0
...,...,...,...,...,...,...,...
1231,113,275.0,1,27.0,60.0,100.0,0.0
1232,128,265.0,0,24.0,67.0,120.0,0.0
1233,130,291.0,0,30.0,65.0,67.0,1.0
1234,125,281.0,1,21.0,65.0,110.0,0.0


2. Preview the first few rows of the dataset. Complete the following checks for your dataset:

    * output the summary information for the dataset, what types of variables make up this dataset?
    * is there any missing data in this dataset?

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1236 entries, 0 to 1235
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   bwt        1236 non-null   int64  
 1   gestation  1223 non-null   float64
 2   parity     1236 non-null   int64  
 3   age        1234 non-null   float64
 4   height     1214 non-null   float64
 5   weight     1200 non-null   float64
 6   smoke      1226 non-null   float64
dtypes: float64(5), int64(2)
memory usage: 67.7 KB


3. Handle the missing data in the dataset -- there isn't much, so we can drop all the rows that have at least one missing value. How many rows were dropped from the dataset?

In [89]:
df.isnull().sum()

bwt           0
gestation    13
parity        0
age           2
height       22
weight       36
smoke        10
dtype: int64

In [90]:
df.dropna(inplace = True)

df.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,77.0,1.0
4,108,282.0,0,23.0,67.0,125.0,1.0
5,136,286.0,0,25.0,62.0,93.0,0.0


In [91]:
df.isnull().sum()

bwt          0
gestation    0
parity       0
age          0
height       0
weight       0
smoke        0
dtype: int64

4. There are two qualitative variables in this dataset. What are they? How do you know they are qualitative variables?

In [92]:
df.describe()
##Two Qualitative variables parity and smoke i.e 1 or o means yes no as non-numeric type

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
count,1174.0,1174.0,1174.0,1174.0,1174.0,1174.0,1174.0
mean,119.462521,279.101363,0.262351,27.228279,64.049404,131.326235,0.390971
std,18.328671,16.010305,0.4401,5.817839,2.526102,54.103828,0.488176
min,55.0,148.0,0.0,15.0,53.0,27.0,0.0
25%,108.0,272.0,0.0,23.0,62.0,114.0,0.0
50%,120.0,280.0,0.0,26.0,64.0,125.0,0.0
75%,131.0,288.0,1.0,31.0,66.0,139.75,1.0
max,176.0,353.0,1.0,45.0,72.0,1500.0,1.0


5. Replace the values in the qualitative variables with meaningful labels that describe the different groups. Use any method you've learned in previous modules to complete this task. 

In [93]:
df["parity"].replace([0, 1],["No","Yes"], inplace=True)
df["smoke"].replace([0, 1],["No","Yes"], inplace=True)
df

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,No,27.0,62.0,100.0,No
1,113,282.0,No,33.0,64.0,135.0,No
2,128,279.0,No,28.0,64.0,77.0,Yes
4,108,282.0,No,23.0,67.0,125.0,Yes
5,136,286.0,No,25.0,62.0,93.0,No
...,...,...,...,...,...,...,...
1231,113,275.0,Yes,27.0,60.0,100.0,No
1232,128,265.0,No,24.0,67.0,120.0,No
1233,130,291.0,No,30.0,65.0,67.0,Yes
1234,125,281.0,Yes,21.0,65.0,110.0,No


6. Before we move forward with any statistics, let's identify and remove any outliers from the dataset. Using the IQR method, search for outliers in the 5 numeric variables. This will take some time and organization, be careful with your code! Make sure you keep track of how many rows/outliers are removed. 

In [136]:
df2 = df.copy()

In [137]:
df2.shape

(1174, 7)

In [138]:
print(df2.shape)

(1174, 7)


In [139]:
## Calculate quartiles
q1 = df2["bwt"].quantile(.25)
q3 = df2["bwt"].quantile(.75)

print("Q1:", q1)
print("Q3:", q3)

Q1: 108.0
Q3: 131.0


In [140]:
## Calculate the IQR
iqr = q3 - q1

print("IQR:", iqr)

IQR: 23.0


In [141]:
## Determine outlier fences 
top = q3 + (iqr * 1.5)
bottom = q1 - (iqr * 1.5)


print("Upper Limit:", top)
print("Lower Limit:", bottom)

Upper Limit: 165.5
Lower Limit: 73.5


In [142]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = df2.loc[(df2['bwt'] > top) | (df2['bwt'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([ 239,  309,  361,  462,  500,  529,  556,  594,  632,  709,  738,
             747,  829,  904,  912,  978, 1021, 1035, 1063, 1065, 1099, 1139,
            1148, 1169],
           dtype='int64')


In [110]:
## what values can we find at these index locations?

df2.iloc[[239,  309,  361,  462,  500,  529,  556,  594,  632,  709,  738,
             747,  829,  904,  912,  978, 1021, 1035, 1063, 1065, 1099, 1139,
            1148, 1169]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
261,81,256.0,No,30.0,64.0,148.0,Yes
332,103,273.0,Yes,22.0,64.0,110.0,Yes
388,98,275.0,No,25.0,65.0,112.0,Yes
497,101,289.0,Yes,31.0,60.0,125.0,No
537,147,277.0,No,30.0,68.0,160.0,No
566,133,280.0,Yes,25.0,61.0,130.0,No
593,118,297.0,No,35.0,68.0,140.0,Yes
632,176,293.0,Yes,19.0,68.0,180.0,No
676,100,275.0,No,26.0,60.0,115.0,No
757,155,279.0,No,33.0,61.0,125.0,No


In [143]:
df2 = df2.reset_index(drop = True)

In [144]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = df2.loc[(df2['bwt'] > top) | (df2['bwt'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([ 220,  287,  335,  430,  465,  492,  519,  557,  594,  663,  692,
             700,  780,  851,  859,  922,  964,  978, 1005, 1007, 1041, 1081,
            1090, 1111],
           dtype='int64')


In [145]:
## what values can we find at these index locations?

df2.iloc[[220,  287,  335,  430,  465,  492,  519,  557,  594,  663,  692,
             700,  780,  851,  859,  922,  964,  978, 1005, 1007, 1041, 1081,
            1090, 1111]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
220,173,293.0,No,30.0,63.0,110.0,No
287,71,281.0,No,32.0,60.0,117.0,Yes
335,71,234.0,No,32.0,64.0,110.0,Yes
430,68,223.0,No,32.0,66.0,149.0,Yes
465,69,232.0,No,31.0,59.0,103.0,Yes
492,71,277.0,No,40.0,69.0,135.0,No
519,174,281.0,No,37.0,67.0,155.0,No
557,170,303.0,Yes,21.0,64.0,129.0,No
594,176,293.0,Yes,19.0,68.0,180.0,No
663,166,299.0,No,26.0,68.0,140.0,No


In [146]:
## Drop rows with above index values
df2 = df2.drop(iqr_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(df2.shape)

(1150, 7)


In [147]:
## Calculate quartiles
q1 = df2["gestation"].quantile(.25)
q3 = df2["gestation"].quantile(.75)

print("Q1:", q1)
print("Q3:", q3)

Q1: 272.0
Q3: 288.0


In [148]:
## Calculate the IQR
iqr = q3 - q1

print("IQR:", iqr)

IQR: 16.0


In [149]:
## Determine outlier fences 
top = q3 + (iqr * 1.5)
bottom = q1 - (iqr * 1.5)


print("Upper Limit:", top)
print("Lower Limit:", bottom)

Upper Limit: 312.0
Lower Limit: 248.0


In [153]:
df2 = df2.reset_index(drop = True)

In [154]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = df2.loc[(df2['gestation'] > top) | (df2['gestation'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([   5,    6,    9,   56,   60,   63,  109,  119,  143,  173,  177,
             182,  193,  198,  200,  215,  220,  232,  237,  256,  319,  343,
             364,  406,  425,  445,  470,  482,  584,  632,  645,  654,  670,
             688,  701,  708,  717,  723,  767,  768,  771,  806,  858,  883,
             899,  930,  947,  952,  996, 1055, 1060, 1062, 1066, 1071, 1090,
            1095, 1114, 1121, 1122, 1131, 1133, 1140],
           dtype='int64')


In [156]:
## what values can we find at these index locations?

df2.iloc[[5, 5,    6,    9,   56,   60,   63,  109,  119,  143,  173,  177,
             182,  193,  198,  200,  215,  220,  232,  237,  256,  319,  343,
             364,  406,  425,  445,  470,  482,  584,  632,  645,  654,  670,
             688,  701,  708,  717,  723,  767,  768,  771,  806,  858,  883,
             899,  930,  947,  952,  996, 1055, 1060, 1062, 1066, 1071, 1090,
            1095, 1114, 1121, 1122, 1131, 1133, 1140]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
5,138,244.0,No,33.0,62.0,178.0,No
5,138,244.0,No,33.0,62.0,178.0,No
6,132,245.0,No,23.0,65.0,140.0,No
9,140,351.0,No,27.0,68.0,120.0,No
56,75,232.0,No,33.0,61.0,110.0,No
...,...,...,...,...,...,...,...
1121,127,242.0,No,17.0,61.0,135.0,Yes
1122,87,247.0,Yes,18.0,66.0,125.0,Yes
1131,146,319.0,No,28.0,66.0,145.0,No
1133,110,321.0,No,28.0,66.0,180.0,No


In [157]:
## Drop rows with above index values
df2 = df2.drop(iqr_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(df2.shape)

(1088, 7)


In [158]:
## Calculate quartiles
q1 = df2["age"].quantile(.25)
q3 = df2["age"].quantile(.75)

print("Q1:", q1)
print("Q3:", q3)

Q1: 23.0
Q3: 31.0


In [159]:
## Calculate the IQR
iqr = q3 - q1

print("IQR:", iqr)

IQR: 8.0


In [160]:
## Determine outlier fences 
top = q3 + (iqr * 1.5)
bottom = q1 - (iqr * 1.5)


print("Upper Limit:", top)
print("Lower Limit:", bottom)

Upper Limit: 43.0
Lower Limit: 11.0


In [161]:
df2 = df2.reset_index(drop = True)

In [162]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = df2.loc[(df2['age'] > top) | (df2['age'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([557, 944], dtype='int64')


In [163]:
## what values can we find at these index locations?

df2.iloc[[557, 944]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
557,143,294.0,No,44.0,65.0,145.0,No
944,122,280.0,Yes,45.0,62.0,128.0,No


In [164]:
## Drop rows with above index values
df2 = df2.drop(iqr_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(df2.shape)

(1086, 7)


In [165]:
## Calculate quartiles
q1 = df2["height"].quantile(.25)
q3 = df2["height"].quantile(.75)

print("Q1:", q1)
print("Q3:", q3)

Q1: 62.0
Q3: 66.0


In [166]:
## Calculate the IQR
iqr = q3 - q1

print("IQR:", iqr)

IQR: 4.0


In [167]:
## Determine outlier fences 
top = q3 + (iqr * 1.5)
bottom = q1 - (iqr * 1.5)


print("Upper Limit:", top)
print("Lower Limit:", bottom)

Upper Limit: 72.0
Lower Limit: 56.0


In [168]:
df2 = df2.reset_index(drop = True)

In [169]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = df2.loc[(df2['height'] > top) | (df2['height'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([378, 1062], dtype='int64')


In [170]:
## what values can we find at these index locations?

df2.iloc[[378, 1062]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
378,146,263.0,No,39.0,53.0,110.0,Yes
1062,141,281.0,No,29.0,54.0,156.0,Yes


In [171]:
## Drop rows with above index values
df2 = df2.drop(iqr_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(df2.shape)

(1084, 7)


In [172]:
## Calculate quartiles
q1 = df2["weight"].quantile(.25)
q3 = df2["weight"].quantile(.75)

print("Q1:", q1)
print("Q3:", q3)

Q1: 114.0
Q3: 137.0


In [173]:
## Calculate the IQR
iqr = q3 - q1

print("IQR:", iqr)

IQR: 23.0


In [174]:
## Determine outlier fences 
top = q3 + (iqr * 1.5)
bottom = q1 - (iqr * 1.5)


print("Upper Limit:", top)
print("Lower Limit:", bottom)

Upper Limit: 171.5
Lower Limit: 79.5


In [175]:
df2 = df2.reset_index(drop = True)

In [176]:
## Determine the index locations for rows that fall outside of outlier fences

iqr_outliers = df2.loc[(df2['weight'] > top) | (df2['weight'] < bottom)].index

print("INDEX VALUES:", iqr_outliers)

INDEX VALUES: Int64Index([   2,   10,   17,   19,   23,   36,   78,  101,  108,  129,  131,
             140,  157,  158,  160,  189,  244,  266,  269,  298,  357,  372,
             398,  441,  443,  453,  457,  490,  521,  533,  547,  610,  611,
             633,  642,  743,  752,  753,  759,  778,  800,  811,  821,  888,
            1016, 1029, 1064, 1081],
           dtype='int64')


In [177]:
## what values can we find at these index locations?

df2.iloc[[2,   10,   17,   19,   23,   36,   78,  101,  108,  129,  131,
             140,  157,  158,  160,  189,  244,  266,  269,  298,  357,  372,
             398,  441,  443,  453,  457,  490,  521,  533,  547,  610,  611,
             633,  642,  743,  752,  753,  759,  778,  800,  811,  821,  888,
            1016, 1029, 1064, 1081]]

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
2,128,279.0,No,28.0,64.0,77.0,Yes
10,114,273.0,No,30.0,63.0,498.0,No
17,115,274.0,No,27.0,67.0,175.0,Yes
19,122,276.0,No,30.0,68.0,182.0,No
23,114,266.0,No,20.0,65.0,175.0,Yes
36,110,278.0,No,23.0,63.0,177.0,No
78,125,305.0,No,22.0,70.0,196.0,Yes
101,131,283.0,No,25.0,67.0,215.0,No
108,115,283.0,No,25.0,61.0,1500.0,Yes
129,160,300.0,No,29.0,71.0,175.0,Yes


In [178]:
## Drop rows with above index values
df2 = df2.drop(iqr_outliers)

## Re-check the shape of the dataframe, how many rows were dropped?
print(df2.shape)

(1036, 7)


In [179]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1036 entries, 0 to 1083
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   bwt        1036 non-null   int64  
 1   gestation  1036 non-null   float64
 2   parity     1036 non-null   object 
 3   age        1036 non-null   float64
 4   height     1036 non-null   float64
 5   weight     1036 non-null   float64
 6   smoke      1036 non-null   object 
dtypes: float64(4), int64(1), object(2)
memory usage: 64.8+ KB


In [180]:
df2.describe()

Unnamed: 0,bwt,gestation,age,height,weight
count,1036.0,1036.0,1036.0,1036.0,1036.0
mean,120.175676,279.984556,27.159266,64.009653,125.656371
std,16.404372,11.396404,5.723814,2.481215,16.581046
min,75.0,248.0,15.0,56.0,81.0
25%,109.75,273.0,23.0,62.0,113.0
50%,120.0,280.0,26.0,64.0,125.0
75%,131.0,288.0,31.0,66.0,135.0
max,165.0,312.0,43.0,72.0,171.0


7. Describe the characteristics of your qualitative variables by doing the following:

    * Determine the frequencies of each group within each categorical variable. Which groups have the highest frequency?
    * Determine the relative frequencies of each group within each categorical variable. Which groups have the highest relative frequencies?
    * Create a crosstab table using both the categorical variables in your dataset. How many mothers are smokers who have been pregnant before?

In [191]:
## Frequencies 

df2["bwt"].value_counts()

117    31
120    30
115    30
129    30
125    26
       ..
151     1
89      1
159     1
157     1
79      1
Name: bwt, Length: 89, dtype: int64

In [192]:
## Frequencies 

df2["gestation"].value_counts()

282.0    45
278.0    41
275.0    38
280.0    37
281.0    37
         ..
312.0     2
263.0     1
309.0     1
253.0     1
311.0     1
Name: gestation, Length: 64, dtype: int64

In [188]:
## Frequencies 

df["age"].value_counts()

23.0    90
24.0    83
26.0    82
27.0    80
22.0    77
25.0    75
21.0    65
28.0    64
29.0    61
30.0    61
20.0    57
19.0    49
33.0    42
31.0    41
32.0    36
34.0    31
35.0    29
37.0    27
36.0    24
39.0    22
38.0    18
18.0    15
41.0    14
40.0    11
17.0     7
43.0     6
42.0     4
15.0     1
44.0     1
45.0     1
Name: age, dtype: int64

In [193]:
## Frequencies 
### height group/column has highest frequencies

df2["height"].value_counts()

65.0    159
64.0    157
63.0    141
66.0    132
62.0    116
61.0     87
67.0     85
60.0     53
68.0     40
59.0     22
69.0     18
70.0     12
58.0      8
71.0      3
56.0      1
72.0      1
57.0      1
Name: height, dtype: int64

In [194]:
## Frequencies 

df2["weight"].value_counts()

130.0    69
125.0    64
110.0    53
135.0    53
120.0    51
         ..
164.0     1
92.0      1
152.0     1
168.0     1
87.0      1
Name: weight, Length: 81, dtype: int64

In [195]:
## Relative Frequencies 

df2["bwt"].value_counts(normalize=True)

117    0.029923
120    0.028958
115    0.028958
129    0.028958
125    0.025097
         ...   
151    0.000965
89     0.000965
159    0.000965
157    0.000965
79     0.000965
Name: bwt, Length: 89, dtype: float64

In [196]:
## Relative Frequencies 

df["gestation"].value_counts(normalize=True)

282.0    0.040034
278.0    0.035775
277.0    0.034923
280.0    0.034072
281.0    0.034072
           ...   
233.0    0.000852
320.0    0.000852
239.0    0.000852
243.0    0.000852
321.0    0.000852
Name: gestation, Length: 104, dtype: float64

In [197]:
## Relative Frequencies 

df["age"].value_counts(normalize=True)

23.0    0.076661
24.0    0.070698
26.0    0.069847
27.0    0.068143
22.0    0.065588
25.0    0.063884
21.0    0.055366
28.0    0.054514
29.0    0.051959
30.0    0.051959
20.0    0.048552
19.0    0.041738
33.0    0.035775
31.0    0.034923
32.0    0.030664
34.0    0.026405
35.0    0.024702
37.0    0.022998
36.0    0.020443
39.0    0.018739
38.0    0.015332
18.0    0.012777
41.0    0.011925
40.0    0.009370
17.0    0.005963
43.0    0.005111
42.0    0.003407
15.0    0.000852
44.0    0.000852
45.0    0.000852
Name: age, dtype: float64

In [198]:
## Relative Frequencies 
### height group/column has highest relative frequencies
df["height"].value_counts(normalize=True)

64.0    0.152470
65.0    0.149915
63.0    0.137138
66.0    0.126065
62.0    0.108177
67.0    0.086882
61.0    0.084327
60.0    0.045997
68.0    0.044293
59.0    0.020443
69.0    0.016184
70.0    0.011073
58.0    0.008518
71.0    0.004259
53.0    0.000852
57.0    0.000852
56.0    0.000852
72.0    0.000852
54.0    0.000852
Name: height, dtype: float64

In [199]:
## Relative Frequencies 

df["weight"].value_counts(normalize=True)

130.0    0.064736
125.0    0.057070
110.0    0.051107
135.0    0.050256
120.0    0.045145
           ...   
845.0    0.000852
399.0    0.000852
197.0    0.000852
171.0    0.000852
67.0     0.000852
Name: weight, Length: 115, dtype: float64

In [201]:
### 110 Mothers are smokers who have been pregnant before

pd.crosstab(df2["parity"], df2["smoke"], margins = True, normalize = False)

smoke,No,Yes,All
parity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,459,295,754
Yes,172,110,282
All,631,405,1036


8. Describe the Measures of Central Tendency (mean, median, mode) of the numeric variables in the dataset. Calculate the mean, median, and mode for all 5 numeric variables. For each median and mode, describe the following:

    * What does the median value tell you about the data?
    * What does the mode value tell you about the data?

In [202]:
## Calculate the Mean of a column 

df2["bwt"].mean()

120.17567567567568

In [203]:
## Calculate the median of a column 
### the median value is the middle value in an ordered data column

df2["bwt"].median()

120.0

In [204]:
## Calculate the mode of a column 
### the mode value is the most frequent value in the data
df2["bwt"].mode()

0    117
Name: bwt, dtype: int64

In [205]:
## Calculate the Mean of a column 

df2["height"].mean()

64.00965250965251

In [206]:
## Calculate the median of a column 
### the median value is the middle value in an ordered data column

df2["height"].median()

64.0

In [207]:
## Calculate the mode of a column 
### the mode value is the most frequent value in the data
df2["height"].mode()

0    65.0
Name: height, dtype: float64

9. Describe the Measures of Variability of the numeric variables in the dataset. Calculate the range, standard deviation, and 85th percentile for all 5 numeric variables. Answer the following questions:

    * Which column has the largest range? Which has the smallest? What does this mean?
    * Which column has the largest standard deviation? Which has the smallest? What does this tell you about those variables?

In [208]:
## Calculate the range of values in a column

hrs_range = df2['bwt'].max() - df2['bwt'].min()

print(hrs_range)

90


In [212]:
## Calculate the range of values in a column

hrs_range = df2['gestation'].max() - df2['gestation'].min()

print(hrs_range)

64.0


In [213]:
## Calculate the range of values in a column

hrs_range = df2['age'].max() - df2['age'].min()

print(hrs_range)

28.0


In [209]:
## Calculate the range of values in a column
### height has the smallest range

hrs_range = df2['height'].max() - df2['height'].min()

print(hrs_range)

16.0


In [214]:
## Calculate the range of values in a column
### weight and bwt have larger ranges of 90 meaning these values are largely distrbuted 

hrs_range = df2['weight'].max() - df2['weight'].min()

print(hrs_range)

90.0


In [215]:
## Calculate the Standard Deviation of values in a column

print("Mean bwt:", df2["bwt"].mean())

df2["bwt"].std()

Mean bwt: 120.17567567567568


16.40437160501985

In [216]:
## Calculate the Standard Deviation of values in a column

print("Mean gestation:", df2["gestation"].mean())

df2["gestation"].std()

Mean gestation: 279.984555984556


11.396403911826766

In [217]:
## Calculate the Standard Deviation of values in a column

print("Mean age:", df2["age"].mean())

df2["age"].std()

Mean age: 27.159266409266408


5.723814227977458

In [211]:
## Calculate the Standard Deviation of values in a column
### height column has smallest std value

print("Mean height:", df2["height"].mean())

df2["height"].std()

Mean height: 64.00965250965251


2.481215121299967

In [219]:
## Calculate the Standard Deviation of values in a column
### weight column has largest std value
#### this tells about weight values are highly distributed or disbursed and height values are relatively condensed
print("Mean weight:", df2["weight"].mean())

df2["weight"].std()

Mean weight: 125.65637065637065


16.581045809853826