# Data Analysis with Python
Part of the SWEET Workshop series presented by the [IDEA Student Center at UC San Diego](http://www.jacobsschool.ucsd.edu/student/).

### Goals
Learn the basics of data analysis using Python.

### Requirements
- numpy
- matplotlib

In [None]:
# load required packages

# vectorized functions
import numpy as np

# plotting
import matplotlib.pyplot as plt
%matplotlib inline

# make the code compatible with python 2.x and 3.x
from __future__ import print_function, division

## 1) Loading data
Let's start by loading an example data file: a NASA data set from testing the aerodynamic and acoustic performance of different airfoil blade designs.

#### Source
https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise

#### Data description
This problem has the following inputs: 
1. Frequency, in Hertzs. 
2. Angle of attack, in degrees. 
3. Chord length, in meters. 
4. Free-stream velocity, in meters per second. 
5. Suction side displacement thickness, in meters. 

The only output is: 
6. Scaled sound pressure level, in decibels.

**Discussion**: Based on the data description:
- How many columns should we get from loading the file?
- Which variables have numeric values (if any)?

In [None]:
# we'll load the data using numpy's genfromtxt() function
#
# NOTE: the data columns are separated by commas
#

# load the data
data = np.genfromtxt("airfoil_self_noise.csv", delimiter=",")

# check the data dimensions
#print( data.shape )

# check the data type
#print( type(data) )

# check the data type of one of the individual elements
#print( type(data[0, 0]) )

**Discussion**:
- How many rows and columns are there?
- What data type of data was loaded? Numbers? Text?


And which column is which variable (e.g. frequency)?
- column 0: frequency [Hz]
- column 1: ???
- column 2: ???
- column 3: ???
- column 4: ???
- column 5: scaled sound pressure [dB]

## 2) Visualizing data
Now that we've loaded some data, it makes sense to try to visualize it.

In [None]:
# select two of the variables from the data set
x = data[:, 0]
y = data[:, ???]

# create a scatter plot
plt.scatter(x, y)

plt.show()

Let's try to improve the plot formatting. After all, almost every data analysis project will involve create a visual that can then be presented to someone (coworkers, project supervisors, clients, etc.).

Ideas for formatting revisions:
- colors
- figure size
- text labels
- font sizes

In [None]:
# select two of the variables from the data set
x = data[:, 0]
y = data[:, ???]

# set the figure size
plt.figure(figsize=(???, ???))

# create a scatter plot
plt.scatter(x, y, color='???')

# add labels
plt.xlabel('???')
plt.ylabel('???')

# add a grid
plt.grid()

plt.show()

## 3) Statistics
In addition to visuals, statistics can provide valuable information about a data set. Let's try calculating a few common statistics using numpy.

In [None]:
# select one of the variables from the data set
x = data[:, 5]

# what's the mean value?
print( np.mean(x) )

In [None]:
# what's the min and max values?
print( np.max(???) )
print( np.???(???) )

In [None]:
# what about the standard deviation?
print( np.???(x) )

## 4) Fitting data
Another common tasks when working with data is fitting a model to the data. For example, fitting a linear mapping ($y = ax + b$) between two variables ($x$ and $y$). There are many methods for accomplishing this task, but we'll focus on a simple one using numpy and a new data.

#### Data set
Measurements of global horizontal irradiance (GHI) [W/m^2] and power output [kW] from a PV power plant in San Diego County.
- column 0: GHI [W/m^2]
- column 1: power [kW]

In [None]:
# load the new data set
power_data = np.genfromtxt('sample_ghi_power.csv', delimiter=',')

# check data size (i.e. number of rows and columns)


In [None]:
# make a quick scatter plot
ghi = power_data[:, 0]
power = power_data[:, ???]

# scatter plot
plt.scatter(???, ???)

# add labels
plt.xlabel('???')
plt.ylabel('???')

plt.show()

In [None]:
# now we'll create a polynomial model fitted to the data

# set the input (x) and output (y) variables
x = ghi
y = power

# fit the model to the data:
# - 1 = first order fit => linear fit
# - 2 = second order fit => quadratic
# - etc.
#
coeff = np.polyfit(x, y, 1)

# create a function we can call to fitted model
model = np.poly1d(coeff)

In [None]:
# now let's plot the fitted model against the data

# create a range of theoretical values to try fitting
x_fit = np.linspace(0, 1000, 1000)

# use the fitted model to estimate the output
y_fit = model(x_fit)


# plot the original data
plt.scatter(x, y, color='0.5')

# overlay the plot of the fitted model
plt.plot(x_fit, y_fit, color='red')

plt.show()

**Discussion**: Try other values for the order of the model (e.g. 2). Based on the results:
- What appears to the best type of fitting model? Linear? Quadratic? Cubic?
- Are there any downsides to increasing the order of the model? E.g. why not do a 100th-order fitting model?