In [17]:
import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pathlib

# Section 1: Exploritory Data Analysis
#### In this Section we will load and examine our datasets for analysis.

# Examining Our Files:
### In order to perform EDA, we need to load our datasets into the notebook. 
###### Once we have our files, we need to examine them and ask the following questions:
   * How much data do we have and how has that grown?
   * How is the data structured, formatted, and organized?
   * What fields do we have to analyze in each record?

##### We will load our data and use the methods provided below to create dataframes to assist in EDA

In [30]:
from pathlib import Path
import bz2
import os
import csv

In [37]:
def list_files(directory):
    """
    Return a list of pathlib.Path objects for the files in the directory.
    
    directory: a string describing the directory to list 
        for example 'data/'
    """
    file_list = []
    p = Path(directory)
    
    for child in p.iterdir():
        file_list.append(child)
       
    return file_list[1:]
        
    
def get_file_size(file_name):
    """
    Return file size for a given filename.
    """ 
    
    p = Path(file_name)
    return p.stat().st_size
    
    

def get_linecount_csv(file_name):
    """
    Returns the number of lines in csv file.  
    """ 
    counter = 0
    file = open(file_name)
    reader = csv.reader(file)
    lst = list((reader))
    return len(lst)
    

### To guide our interpretation of incoming data, we will create a dataFrame that shows us the name, size, and linecount of each updated file 

In [38]:
info = []
for f in list_files("data/"):
    name = str(f)
    
    if name[-3:] == "csv": 
        
        size = get_file_size(f)
        linecount = get_linecount_csv(f)
        info.append({"name": name, "size": size, "linecount": linecount})

file_info = pd.DataFrame(info).sort_values("size")
file_info

Unnamed: 0,name,size,linecount
0,data/05-05-2020.csv,9348,59
2,data/time_series_covid19_deaths_US.csv,1039226,3262
1,data/time_series_covid19_confirmed_US.csv,1093856,3262


### Reading in the files:
##### Now that we have an idea of the new data, we will load it in for further EDA

### **Make sure to update date on US_daily_report to yesterdays date**
##### If you don't csv will not read

In [40]:
US_daily_report = pd.read_csv("data/05-05-2020.csv")
US_confirmed_deaths = pd.read_csv("data/time_series_covid19_deaths_US.csv")
US_confirmed_cases = pd.read_csv("data/time_series_covid19_confirmed_US.csv")


# Identifying Issues with our Data
### For each data set, we will load data and identify issues in our data that will be used for cleaning

## Cleaning Data:
#### We will use the issues we identified with with our data to clean it


## Joining Tables: (transformation for EDA proccess)
#### To form a predictive analysis on our data, we will need to manipulate the tables to provide usefull information
###### Tables we are joining and why:
   * Table One
   * Table Two
   * Etc...
   

## EDA Visualization

#### In this section we will finally make observations about our transformed data and form visualizations to show said observations.

#### ******THIS SECTION SHOULD BE LONG --- SERVES TO VISUALIZE SOME OF THE FEATURES WE USE FOR OUR MODEL

## Training Validation Split
##### Split Data for model selection

# Model Selection: 
### What models are we going to use & Why?