# Week 6
This week will work through the following:

* Data Loading - There will be no problems on this just read the cells to see how to load data from sources.
* EDA (Exploritory Data Analysis) - There are five problems in this section to complete and share in class.
* Matrix Operations - There are four problems in this section to complete and share in class.

# Loading & Saving Data

### Reading
- https://wesmckinney.com/book/accessing-data.html

### Learning Outcomes

- Loading data from files
- Loading data from a database
- loading data from the internet
- Saving data to files


# Loading Data into Pandas

Pandas can import structured data from a variety of file formats and data sources.

File can be in plain-text or in certain binary formats Pandas recognizes.

Files can be on the same computer as your program or on a remote system.



## Pandas Data-loading Methods

Some common methods and file formats are:

- **read_csv** - read delimited values from .csv or .tsv files
- **read_html** - read tables from a .html file
- **read_excel** - read .xls, .xlsx Excel spreadsheets
- **read_json** - read JavaScript Object Notation (JSON) from .json files
- **read_sas** - read a dataset created by SAS
- **read_spss** - read a data file created by SPSS
- **read_xml** - .xml - Extensible Markup Language
- **read_sql** - read results of an SQL query
- **read_sql_table** - read a whole SQL table (similar to everything in a table using `read_sql`)

These methods only work when the data is in a tabular form. If the data isn’t tabular (e.g. with complex or nested data), the read method will throw an error.

### Data Loading Options

All the above methods convert input data into a DataFrame. But they vary according to optional arguments for how to interpret the data.

Some coomon considerations are:

- **Indexing** - which columns to read and whether to get colkumn names from the file
- **Type inference** - converting data to optimal types
- **Date & time parsing** - identifying date/time values and combining multiple columns into one
- **Chunked iteration** - reading large files in chunks
- **Handling dirty data** - Includes skipping rows or comments, formatting numeric data, etc.

See a full list of [options for reading CSV files](https://wesmckinney.com/book/accessing-data#tbl-table_read_csv_function).

## Loading Data From Files

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Local Files

Pandas can read files located on the same computer using relative or absolute file paths

In [2]:
# file in same directory
df = pd.read_csv("data.csv")

# file path relative to current directory. 
# Meaning if you had a folder called `examples` in this repo you could access it with the command below. 
#df = pd.read_csv("examples/data.csv")

# absolute file path
#df = pd.read_csv("/usr/johndoe/examples/data.csv")

### Internet Files

In [None]:
# get example data from pulic internet location
url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
df = pd.read_csv(url)
df.head()

Sometimes it's helpful to retrieve a file from the internet and save to disk before reading into a DataFrame.

Python's `urllib.request` module is helpful for that:

In [None]:
from urllib import request
data_url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
request.urlretrieve(data_url, filename='mortality_data.csv')

### Google Drive

## Loading Data from Databases

### Using SQL

Python has libraries for interacting with common relational database platforms:

- sqlite3 - SQLite
- pymysql - MySQL
- psycopg2 - PostgreSQL
- cx_oracle - Oracle
- pymssql - MS SQL Server

You can `query` a database from Python by:

- creating a connection object with the `connect()` method
- getting a cursor object with the `cursor()` method
- executing an SQL query to fetch desired rows with `execute()` and `fetchall()`

For example, to list the tables in a database:
```
import sqlite3
fires_con = sqlite3.connect('Data/FPA_FOD_20170508.sqlite')
fires_cur = fires_con.cursor()
'SELECT name FROM sqlite_master WHERE type="table"').fetchall()
```

SQL query results can be read directly into a DataFrame using the `read_sql_query` method:

```
fires = pd.read_sql_query(
'''SELECT STATE, FIRE_YEAR, DATETIME(DISCOVERY_DATE) AS DISCOVERY_DATE, FIRE_NAME, FIRE_SIZE, LATITUDE, LONGITUDE FROM Fires''', fires_con)

```

## Working with JSON Data

The JSON format is popular for transmitting data between applications and closely matches the structure of a Python `dict`, with the exception of its `null` value and some other minor syntax differences.

There are several Python libraries for handling JSON, including `json` which is built into Python.



In [None]:
#
json_string = """
{
  "state": "AK",
  "cities": [
  {"name": "Anchorage", "pop": 250000, "region": "south-central"},
  {"name": "Fairbanks", "pop": 75000, "region": "interior"},
  {"name": "Juneau", "pop": 25000, "region": "south-east"}
  ],
  "industries": ["fishing","mining","tourism"]
}
"""
from pprint import pprint as pp
import json
data = json.loads(json_string)

# use pretty-print to print formatted json data
pp(data)

cities = pd.DataFrame(data["cities"], columns=["name", "pop"])
cities

# Exploratory Data Analysis

### Reading

- https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/
- https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
- https://greenteapress.com/thinkstats2/html/thinkstats2002.html


### Learning Outcomes

- exploratory data analysis
- common data plots - line, bar, scatter, histogram


### Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA), is a crucial initial step for understanding a dataset and preparing for statistical analysis.

EDA answers questions such as:
- what is the data quality?
- what patterns are evident in the data?
- is the data ready to support analysis & conclusions?

EDA is the process of performing initial investigations to:
- Uncover underlying structure & patterns in the data
- Identify important variables
- Identify anomalies
- Set the stage for statisical analysis & visualization


### Problem 1: Understanding the data
This problem will be a collection of easier tasks. Fill in the 5 empty cells below to complete this problem. 

As a first step, it's important to know your data's composition:

- how many observations?
- how many features (variables)?
- what does your data look like?
- which are the dependent variables?
- are any data missing or incorrect?
- what are the primary statistics for each feature?
- what are the data types?
- are variables numeric or categorical?

In [6]:
# retrieve sample data
download_url = ("https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv")
movie_df = pd.read_csv(download_url)

**Problem 1a**

- how many observations?
- how many features (variables)?


In the cell below print the number of rows and columns in `movie_df`. 

**Problem 1b**

- what does your data look like?
- which are the dependent variables?

In the cell below output the first 10 rows of your data.

**Problem 1c**

- are any data missing or incorrect?
- what are the primary statistics for each feature?

Get a basic statisical description of the data. In the first cell describe the numeric data and in the second describe the objects. 

**Problem 1d**

- what are the data types?
- are variables numeric or categorical?

Get the information about your data frame. You'll want to output the name of each column, the non-null count and the data type of each column.

### Types of data

How you analyze the data will depend on its type.

- **Categorical** - Data that can take on one of a limited, and usually fixed, number of possible values, assigning each row to one of these value often refered to as a category or group. Often type `object` in pandas.
    - **Nominal** - Categorical data with no inherent ordering between the categories (e.g. vehicle types)
    - **Ordinal** - Categorical data with an inherent ordering, but where the “differences” between categories has no numerical meaning (e.g. agree, strongly agree, etc)
- **Numeric/Continuous** - Data that is a number which represents a interval or ratio as described below. Often type `int` or `float` in pandas.
    - **Interval** - Numeric data with ordering, where relative differences in interval data have meaning (e.g. temperatures)
    - **Ratio** - Numeric data where the ratio between measurements has some meaning

One thing to be careful of is if categorical variables are stored as numbers i.e. group_numbers from the set [1, 2, 3, 4]. A good test is to think if you can perform math on the data and have it make sense. i.e. for the groups: 1 - 2 = 1 but this doesn't say anything about the relationship between group 1 and 2. 

- **Dates** - Dates are a tricky third type of data that can be treated as either categorical or numeric depending on the application. 
    - **Categorical** - If you break a date into it's parts you can treat it as categorical data. For example which day of the week has the largest number of ticket sales. Day of the week is a category in this application.
    - **Numeric/Continuous** - You can treat dates as numbers if you think about them as time since a fixed point. Let's say you treated a water sample on 2025-01-01 and each day you measured how clean the water was. Each row would have a date that you tell you how may days have elapsed since 2025-01-01. You could calculate the number of days and use that as a number in a line graph or you could simply use the date and it would convey the same information. 

In this data set we have `year` which describes the year the movie was released. We will use this as numeric for our exploration. 

### Problem 2: Getting list of features
Create two lists: numeric_features and categorical_features which have the column name of each. 

There are a few ways to do this:
1. Use `dtypes`, `columns` and list comprehension
2. Use `select_dtypes` and `columns`

Impliment both ways and then chec that they are the same lists. 

### Problem 3: Print out values counts of categorical features
It is often easier to loop through the list of your columns and perform some basic exploration of columns of the same types. For this problem you will loop through all categorical_features and if they have 10 or fewer values print out the count of each value.

### Problem 4: Histograms for Numeric Columns
For all numeric_features:
1. Print the name of the column
2. Create a histogram plot using df.plot.histogram 
3. Make the histogram show using plt.show()

For 5 points extra credit you can set the bin width using Freedman–Diaconis rule and the number of bins such that 
number_bins = (max - min)/bin_width

https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule

### Problem 5: Numerical Relationships
The other thing we often want to understand is if there is a relaionship between numerical features. Scatter plot matricies are a quick way to evaluate if there is a relationship between a collection of numerical features.

Looking at the histograms above we have several features that look more categorical than numeric i.e. decade code. Because of this we will only create a scatter plot matrix with only a subset of the features:

`['budget', 'domgross', 'intgross', 'budget_2013$', 'domgross_2013$', 'intgross_2013$']`

There are a few scatter plot matricies avaible in the Python data ecosystem. You will use two of them in this problem:

1. pd.plotting.scatter_matrix
2. seaborn.pairplot

## Matrix Opperations

For this section of applying our linear algebra learnings we will work through a few matrix opperation problems. 

### Problem 6: Matrix Addition

Implement matrix addition element-wise using two for loops over rows and columns. What happens when you try to add two matrices with mismatching sizes? This exercise will help you think about breaking down a matrix into rows, columns, and individual elements.

### Problem 7: Matrix Multiplication

Code matrix multiplication using for loops. Confirm your results against using the numpy @ operator. This exercise will help you solidify your understanding of matrix multiplication, but in practice, it’s always better to use @ instead of writing out a double for loop.

### Problem 8: Matrix Opperation Commutivity and Distributivity

Matrix addition and scalar multiplication obey the mathematical laws of commutivity and distributivity. That means that the following equations give the same results (assume that the matrices $A$ and $B$ are the same size and that  is some scalar):

$$\sigma (A + B) = \sigma A + \sigma B = A \sigma + B \sigma$$

Rather than proving this mathematically, you are going to demonstrate it through coding. In Python, create two random-numbers matrices of size 
 and a random scalar. Then implement the three expressions in the previous equation. You’ll need to figure out a way to confirm that the three results are equal. Keep in mind that tiny computer precision errors in the range of 
 should be ignored.

### Problem 9: Checking Matrix Symmetry

In this exercise, you will write a Python function that checks whether a matrix is symmetric. It should take a matrix as input, and should output a boolean True if the matrix is symmetric or False if the matrix is nonsymmetric. Keep in mind that small computer rounding/precision errors can make “equal” matrices appear unequal. Therefore, you will need to test for equality with some reasonable tolerance. Test the function on symmetric and nonsymmetric matrices.