## Lab 3.4: csvs, functions, numpy, and distributions

Run the cell below to load the required packages and set up plotting in the notebook!

In [1]:
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
%matplotlib inline

### Sales data

For this lab we will be using a truncated version of some sales data that we will be looking at further down the line in more detail. 

The csv has about 200 rows of data and 4 columns. The relative path to the csv ```sales_info.csv``` is provided below. If you copied files over and moved them around, this might be different for you and you will have to figure out the correct relative path to enter.

In [2]:
sales_csv_path = '../../assets/datasets/sales_info.csv'

#### 1. Loading the data

Set up an empty list called ```rows```.

Using the pattern for loading csvs we learned earlier, add all of the rows in the csv file to the rows list.

For your reference, the pattern is:
```python
with open(my_csv_path, 'r') as f:
    reader = csv.reader(f)
    ...
```

Beyond this, adding the rows in the csv file to the ```rows``` variable is up to you.

In [3]:
rows = []
with open(sales_csv_path, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row)

##### 2. Separate header and data

The header of the csv is contained in the first index of the ```rows``` variable, as it is the first row in the csv file. 

Use python indexing to create two new variables: ```header``` which contains the 4 column names, and ```data``` which contains the list of lists, each sub-list representing a row from the csv.

Lastly, print ```header``` to see the names of the columns.

In [4]:
header = rows[0]
data = [x for x in rows if x != rows[0]]
#data = [1:]

print header

['volume_sold', '2015_margin', '2015_q1_sales', '2016_q1_sales']


#### 3. Create a dictionary with the data

Use loops or list comprehensions to create a dictionary called ```sales_data```, where the keys of the dictionary are the column names, and the values of the dictionary are lists of the data points of the column corresponding to that column name.

In [5]:
#sales_data = {h:[val for val in data] for h in header}
sales_data = {}

for c_index, c_name in enumerate(header):
    sales_data[c_name] = []
    for val in data:
        sales_data[c_name].append(val[c_index])

**3.A** Print out the first 10 items of the 'volume_sold' column.

In [6]:
print sales_data['volume_sold'][:10]

['18.4207604861', '4.77650991918', '16.6024006077', '4.29611149826', '8.15602328201', '5.00512242518', '14.60675', '4.45646649485', '5.04752965097', '5.38807023767']


In [7]:
# sd_values = sales_data.values()
# sd_values

In [8]:
# sditems = sales_data.items()

In [9]:
# sd2 = []
# for l in sd_values:
#     sd2.append(map(float(x), l))
# sd2

#### 4. Convert data from string to float

As you can see, the data is still in string format (which is how it is read in from the csv). For each key:value pair in our ```sales_data``` dictionary, convert the values (column data) from string values to float values.

In [10]:
# sales_data = {k:int(k) for k in sales_data}
# print sales_data
# sales_data_int = {k:[float(value) for value in sales_data.items() if type(value)==str] for k in sales_data}

# def float_converter(dict):
#     for k, v in enumerate(sales_data):
#         if type(value) == str:
#             value = float(value)


# print float_converter(sales_data)

for name, col in sales_data.items():
    col = [float(x) for x in col]
    sales_data[name] = col

In [12]:
sales_data

{'2015_margin': [93.8022814583,
  21.0824246877,
  93.6124943024,
  16.8247038328,
  35.0114570034,
  31.8774372328,
  76.5189730216,
  19.3373453608,
  26.142470349,
  22.4270237673,
  41.892132964,
  66.4030492424,
  47.8693242069,
  86.7601504011,
  22.9481762576,
  18.7060545353,
  44.0411766297,
  62.1990040107,
  14.2518095238,
  16.0432686391,
  25.1911714286,
  31.7530658307,
  23.1614514016,
  48.8207407407,
  73.2315044248,
  23.4503335716,
  14.1447926267,
  36.4085284899,
  36.1718619066,
  59.8934779211,
  37.1085548647,
  52.4055916932,
  30.681099171,
  48.1333683392,
  47.7406803594,
  97.2243398739,
  31.2923926822,
  35.27017991,
  31.9091555963,
  29.1482532051,
  32.6235916667,
  47.9893704508,
  55.5221865049,
  31.941637952,
  49.3420628537,
  42.8693852124,
  53.1849073341,
  25.4050062762,
  43.9390962494,
  44.5348318408,
  39.530065189,
  31.5106033203,
  50.1331972789,
  28.7115801384,
  52.4235630748,
  24.0028010033,
  47.3184344342,
  49.1944300868,
  19.9

#### 5. Write function to print summary statistics

Now write a function to print out summary statistics for the data.

Your function should:

- Accept two arguments: the column name and the data associated with that column
- Print out information, clearly labeling each item when you print it:
    1. Print out the column name
    2. Print the mean of the data using ```np.mean()```
    3. Print out the median of the data using ```np.median()```
    4. Print out the mode of the **rounded data** using ```stats.mode()```
    5. Print out the variance of the data using ```np.var()```
    6. Print out the standard deviation of the data using ```np.std()```
    
Remember that you will need to convert the numeric data from these function to strings by wrapping them in the ```str()``` function.

**5.A** Using your function, print the summary statistics for 'volume_sold'

**5.B** Using your function, print the summary statistics for '2015_margin'

**5.C** Using your function, print the summary statistics for '2015_q1_sales'

**5.D** Using your function, print the summary statistics for '2016_q1_sales'

#### 6. Plot the distributions

We've provided a plotting function below called ```distribution_plotter()```. It takes two arguments, the name of the column and the data associated with that column.

In individual cells, plot the distributions for each of the 4 columns. Do the data appear skewed? Symmetrical? If skewed, what would be your hypothesis for why?

In [11]:
def distribution_plotter(column, data):
    sns.set(rc={"figure.figsize": (10, 7)})
    sns.set_style("white")
    dist = sns.distplot(data, hist_kws={'alpha':0.2}, kde_kws={'linewidth':5})
    dist.set_title('Distribution of ' + column + '\n', fontsize=16)