# ME 364 Midterm Project - Part 2

## 1. Setup and Data Cleaning

### DataFrame Setup
Import libraries and dataset, create dataframe, and show first few lines
to confirm that it imported correctly

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/yairg98/Data-Driven-Problem-Solving/main/Midterm%20Project/Part%202/Manufacturing_Industry_Database.csv'
df = pd.read_csv(url)

df.head()

Unnamed: 0,naics,naics_title,year,emp,pay,prode,prodh,prodw,vship,matcost,...,equip,plant,piship,pimat,piinv,pien,dtfp5,tfp5,dtfp4,tfp4
0,311111,Dog and Cat Food Manufacturing,1958,18.0,81.3,12.0,25.7,49.8,1042.4,752.4,...,517.6,695.0,0.354,0.133,0.168,0.116,,0.317,,0.318
1,311111,Dog and Cat Food Manufacturing,1959,17.9,82.5,11.8,25.5,49.4,1051.0,758.9,...,573.7,725.3,0.345,0.131,0.169,0.115,0.002,0.318,0.002,0.318
2,311111,Dog and Cat Food Manufacturing,1960,17.7,84.8,11.7,25.4,50.0,1050.2,752.8,...,611.6,739.2,0.274,0.129,0.173,0.119,0.217,0.394,0.216,0.395
3,311111,Dog and Cat Food Manufacturing,1961,17.5,87.4,11.5,25.4,51.4,1119.7,803.6,...,642.6,752.0,0.273,0.131,0.172,0.117,0.024,0.404,0.024,0.405
4,311111,Dog and Cat Food Manufacturing,1962,17.6,90.2,11.5,25.2,52.1,1175.7,853.3,...,669.1,768.7,0.277,0.132,0.174,0.119,-0.007,0.401,-0.007,0.402


### Missing Data Count

Identify total number of entries and evaluate missing data by both
number and percentage of values missing from each column

In the table below, multiple groups of columns are immediately apparent
which have precisely the same number of missing values and can therefore
be assumed to go together (none or all missing for any given row).

Also, to account for the possibility that strings have been used istead
of Null to represent missing numeric values, we can include present data
types for each column in the table. As shown below, these look correct.

In [33]:
# Number of rows in dataset
rows = len(df.index)
print("Number of entries in dataset: {}".format(rows))

# Number of missing values in dataset
missing = df.isnull().sum()
print("Total number of values missing: {}".format(missing.sum()))
print("Number of missing values in each column:")

df_specs = pd.DataFrame(
    {"num_missing": missing,
    "percent_missing": missing.divide(rows/100),
    "dtype": df.dtypes})
    
df_specs.sort_values(by='percent_missing', ascending=False)

Number of entries in dataset: 22204
Total number of values missing: 9490
Number of missing values in each column:


Unnamed: 0,num_missing,percent_missing,dtype
dtfp4,1209,5.444965,float64
dtfp5,1209,5.444965,float64
tfp4,845,3.805621,float64
tfp5,845,3.805621,float64
piinv,845,3.805621,float64
plant,845,3.805621,float64
equip,845,3.805621,float64
cap,845,3.805621,float64
invest,481,2.166276,float64
energy,117,0.526932,float64


### Missing Data Analysis & Cleaning

Noticing in the table above that columns seem to be missing data in
groups, we can limit further analysis to one column from each group and
extrapolate any trends or patterns to the other members of that group.

- Group A: dtfp4, dtfp5
- Group B: tfp4, tfp5, piinv, plant, equip, cap
- Group C: invest
- Group D: energy, pien, pimat, piship, invent, vadd, matcost, vship,
    prodw, prodh, prode, pay, emp

The resulting table below shows that columns in groups A and B are
completely missing for the last two years in the dataset, 2017 and 2018.
Year 2018 is also missing all data from column group C, and 1958 (the
first year in the dataset) has no data from column group A. Outside of
those exceptions, no more than 0.825% of values are missing for any
given column group in any given year. Therefore, years 2017 and 2018
will be ignored when considering any features from groups A or B. Year
2018 will also be ignored for any analysis involving column group C, and
1958 will be ignored for any analysis involving column group A.

In [53]:
# Analyzing the missing data by year
total_entries_by_year = df['year'].value_counts()
num_missing_by_year = df.groupby('year').apply(lambda x: x.isnull().sum())
percent_missing_by_year = df.groupby('year').apply(lambda x: x.isnull().mean())

# List of group representatives (one column from each column-group)
group_reps = ['dtfp4', 'tfp4', 'invest', 'energy']

# Missing values by year
percent_missing_by_year = percent_missing_by_year[group_reps].rename(
    columns = {'dtfp4': 'A', 'tfp4': 'B', 'invest': 'C', 'energy': 'D'})

print(percent_missing_by_year.to_string())

             A         B         C         D
year                                        
1958  1.000000  0.008242  0.008242  0.008242
1959  0.008242  0.008242  0.008242  0.008242
1960  0.008242  0.008242  0.008242  0.008242
1961  0.008242  0.008242  0.008242  0.008242
1962  0.008242  0.008242  0.008242  0.008242
1963  0.008242  0.008242  0.008242  0.008242
1964  0.008242  0.008242  0.008242  0.008242
1965  0.008242  0.008242  0.008242  0.008242
1966  0.008242  0.008242  0.008242  0.008242
1967  0.008242  0.008242  0.008242  0.008242
1968  0.008242  0.008242  0.008242  0.008242
1969  0.008242  0.008242  0.008242  0.008242
1970  0.008242  0.008242  0.008242  0.008242
1971  0.008242  0.008242  0.008242  0.008242
1972  0.008242  0.008242  0.008242  0.008242
1973  0.008242  0.008242  0.008242  0.008242
1974  0.008242  0.008242  0.008242  0.008242
1975  0.008242  0.008242  0.008242  0.008242
1976  0.008242  0.008242  0.008242  0.008242
1977  0.008242  0.008242  0.008242  0.008242
1978  0.00

## 2. Descriptive Statistics