#  1. Acquire the Data

> "Data is the new oil"

**Ways to acquire data** (typical data source)

- Download from an internal system
- Obtained from client, or other 3rd party
- Extracted from a web-based API
- Scraped from a website
- Extracted from a PDF file
- Gathered manually and recorded

**Data Formats**
- Flat files (e.g. csv)
- Excel files
- Database (e.g. MySQL)
- JSON
- HDFS (Hadoop)

Two Datasets
- Price of Weed in US
- Demographic data by US State 


## 1.1 - Crowdsource the Price of Weed dataset

![Price of weed website](http://www.priceofweed.com/app/misc/images/logo.png)

The Price of Weed website - http://www.priceofweed.com/

Crowdsources the price paid by people on the street to get weed. Self Reported.
- **Location** is auto detected or can be choosen
- **Quality** is classified in three categories 
    - High 
    - Medium
    - Low
- **Price by weight**
    - an ounce
    - a half ounce
    - a quarter
    - an eighth
    - 10 grams
    - 5 grams
    - 1 gram
- **Strain** (though not showed in the dataset)

Reported at individual transaction level

Here is a sample data set from United States - http://www.priceofweed.com/prices/United-States.html

See note - *Averages are corrected for outliers based on standard deviation from the mean.*


## 1.2  Scrape the data

[Frank Bi](https://github.com/frankbi) from The Verge wrote a script to scrape the data daily. The daily prices are available on github at https://github.com/frankbi/price-of-weed

Here is sample data from one day - 23rd July 2015 - https://github.com/frankbi/price-of-weed/blob/master/data/weedprices23072015.csv


## 1.3  Combine the data

All the csv files for each day were combined into one large csv. Done by YHAT.

http://blog.yhathq.com/posts/7-funny-datasets.html


## 1.4 Key Questions / Assumptions

> Data is an abstraction of the reality.

- What assumptions have been in this entire data collections process?
- Are we aware of the assumptions in this process?
- How to ensure that the data is accurate or representative for the question we are trying to answer? 


## 1.5 Loading the Data


In [None]:
# Load the libraries
import pandas as pd
import numpy as np

In [None]:
# Load the dataset
df = pd.read_csv("../input/usa-weed-price-data/Weed_Price.csv")

In [None]:
# Shape of the dateset - rows & columns
df.shape

In [None]:
# Check for type of each variable
df.dtypes

In [None]:
# Lets load this again with date as date type
df = pd.read_csv("../input/usa-weed-price-data/Weed_Price.csv", parse_dates=[-1])

In [None]:
# Now check for type for each row
df.dtypes

In [None]:
# Get the names of all columns
df.columns

In [None]:
# Get the index of all rows
df.index

## 1.6 Viewing the Data

In [None]:
# Can we see some sample rows - the top 5 rows
df.head()

In [None]:
# Can we see some sample rows - the bottom 5 rows
df.tail()

In [None]:
# Get specific rows
df[20:25]

In [None]:
# Can we access a specific columns
df["State"]

In [None]:
# Using the dot notation
df.State

In [None]:
# Selecting specific column and rows
df[0:5]["State"]

In [None]:
# Works both ways
df["State"][0:5]

In [None]:
#Getting unique values of State
pd.unique(df['State'])

## 1.7 Slicing columns using pandas

In [None]:
df.index

In [None]:
df.loc[0]

In [None]:
df.iloc[0,0]

# Exercise

1) Load the Demographics_State.csv dataset

2) Show the five first rows of the dataset

3) Select the column with the State name in the data frame

4) Get help

5) Change index to date 

6) Get all the data for 2nd January 2014

# Thinking in Vectors

Difference between loops and vectors

In [None]:
#Find weighted average price with respective weights of 0.6, 0.4 for HighQ and MedQ

In [None]:
#Python approach. Loop over all rows. 
#For each row, multiply the respective columns by those weights. 
#Add the output to an array

In [None]:
#It is easy to convert pandas series to numpy array.
highq_np = np.array(df.HighQ)
medq_np = np.array(df.MedQ)

In [None]:
#Standard pythonic code

def find_weighted_price():
    global weighted_price
    weighted_price = []
    
    for i in range(df.shape[0]):
        weighted_price.append(0.6*highq_np[i]*0.4*highq_np[i])

#print the weighted price
find_weighted_price()
print(weighted_price)

**Exercise**: Find the running time of the above program

In [None]:
#Vectorized Code
weighted_price_vec = 0.6*highq_np + 0.4*medq_np

**Exercise**: Time the above vectorized code. Do you see any improvements?