## Getting Data


### APIs

### SQL

### Scraping


## Cleaning Data

### re-scaling

### missing - understand why and magintude. Knn vs Median vs ..

### outliers

### text and categorical

### Sklearn pipelines


## Describing Data

### Pandas describe, info, value_counts, correlation matrix (and limits - Simpson;s paradox, etc. DS f S)

### mean, median and percentiles, std_dev

### group by and pivot tables, reshape, and cross tab


## Visualizing Data

### matplotlib and seaborn

### Histogram, line, data aware, bar, and scatter

### Geo data

## Getting Data

One of the first steps for a machine learning project is getting the data! In industry, this can sometimes be more difficult that it sounds. :)

In this section we will review some of the most common ways of accessing data and some sources of data.

### Data Sources

* [Kaggle](https://www.kaggle.com)
* [UCI](https://archive.ics.uci.edu/ml/datasets.html)
* [Awesome Public Data Sets](https://github.com/caesar0301/awesome-public-datasets)
* A website via web scraping
* A website's API

### Reading in CSV files

CSV files are extremely common in machine learning. These files have a row of data per line of the file and each line is a comma seperated list in which each element is a column. Pandas makes it extremely easy to read in these data.

The documentation can be found [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). A few parameters of note:

1. sep - this defaults to a comma, but we can specify anything we want. For example, CSV format is poor if some of your columns contain commas. A better option might be a |.
2. header - which row (if any) have the column names.
3. names - column names to use

If your CSV is well formatted with the first row being the column names, then the default parameters should work well.

In [2]:
import pandas as pd

In [3]:
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                      header=None, names=names)

In [4]:
train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educationnum,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,nativecountry,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Reading JSON

JSON is also a very popular format as it allows for a more flexible schema. A lot of the data sent around the web is transmitted as JSON. Here is an example:
```text
{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}
```
Python can actually quite easily read these data from strings into dictionaries:

In [5]:
import json

In [6]:
json_string = """{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}"""

In [7]:
json_dict = json.loads(json_string)
json_dict

{'glossary': {'GlossDiv': {'GlossList': {'GlossEntry': {'Abbrev': 'ISO 8879:1986',
     'Acronym': 'SGML',
     'GlossDef': {'GlossSeeAlso': ['GML', 'XML'],
      'para': 'A meta-markup language, used to create markup languages such as DocBook.'},
     'GlossSee': 'markup',
     'GlossTerm': 'Standard Generalized Markup Language',
     'ID': 'SGML',
     'SortAs': 'SGML'}},
   'title': 'S'},
  'title': 'example glossary'}}

In [9]:
json_dict["glossary"]

{'GlossDiv': {'GlossList': {'GlossEntry': {'Abbrev': 'ISO 8879:1986',
    'Acronym': 'SGML',
    'GlossDef': {'GlossSeeAlso': ['GML', 'XML'],
     'para': 'A meta-markup language, used to create markup languages such as DocBook.'},
    'GlossSee': 'markup',
    'GlossTerm': 'Standard Generalized Markup Language',
    'ID': 'SGML',
    'SortAs': 'SGML'}},
  'title': 'S'},
 'title': 'example glossary'}

Here is a somewhat more realistic example. Let's say that we have some data on web page: https://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/ytd/12/1880-2016.json

We would like to grab the data directly from the page and put get it into a data frame. First, we need to get the data. The simpliest way is to use the requests package:

In [12]:
import requests

In [13]:
get_request = requests.get("https://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/ytd/12/1880-2016.json")

In [14]:
get_request.text

'{"description":{"title":"Global Land and Ocean Temperature Anomalies, January-December","units":"Degrees Celsius","base_period":"1901-2000","missing":"-999.0000"},"data":{"1880":"-0.13","1881":"-0.07","1882":"-0.07","1883":"-0.15","1884":"-0.21","1885":"-0.22","1886":"-0.21","1887":"-0.25","1888":"-0.15","1889":"-0.10","1890":"-0.32","1891":"-0.25","1892":"-0.30","1893":"-0.32","1894":"-0.28","1895":"-0.23","1896":"-0.09","1897":"-0.12","1898":"-0.26","1899":"-0.12","1900":"-0.07","1901":"-0.14","1902":"-0.25","1903":"-0.34","1904":"-0.42","1905":"-0.29","1906":"-0.22","1907":"-0.38","1908":"-0.44","1909":"-0.43","1910":"-0.39","1911":"-0.44","1912":"-0.33","1913":"-0.32","1914":"-0.14","1915":"-0.07","1916":"-0.29","1917":"-0.31","1918":"-0.20","1919":"-0.20","1920":"-0.21","1921":"-0.14","1922":"-0.23","1923":"-0.21","1924":"-0.25","1925":"-0.14","1926":"-0.06","1927":"-0.15","1928":"-0.17","1929":"-0.29","1930":"-0.10","1931":"-0.07","1932":"-0.12","1933":"-0.25","1934":"-0.10","19

Without going into too much detail, a GET request basically just goes and grabs the data from the page. You can learn some more [here](https://www.w3schools.com/tags/ref_httpmethods.asp). You can then get the raw text with the .text. From there we can load the text using json.loads:

In [15]:
climate_dict = json.loads(get_request.text)

In [16]:
climate_dict.keys()

dict_keys(['description', 'data'])

In [17]:
climate_dict['description']

{'base_period': '1901-2000',
 'missing': '-999.0000',
 'title': 'Global Land and Ocean Temperature Anomalies, January-December',
 'units': 'Degrees Celsius'}

In [21]:
series = pd.Series(climate_dict['data'])

In [22]:
series.head()

1880    -0.13
1881    -0.07
1882    -0.07
1883    -0.15
1884    -0.21
dtype: object

Since we just have one column, we use pandas series functionality as opposed to a data frame. Data frames do have the ability to load [from json](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html) and [dictionaries](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html).

### Reading raw files

Sometimes you also just want to parse data line by line yourself. For example, here is a very simple data set of female baby names:

http://deron.meranda.us/data/census-dist-female-first.txt

Perhaps, we just want to extract all of the names which I have saved to a file in the Git repo. In raw Python we can do:

In [24]:
names = []
with open("../small_data/male_names.txt", "r") as f:
    for line in f:
        tokens = line.split(" ")
        names.append(tokens[0])

In [25]:
names[:5]

['JAMES', 'JOHN', 'ROBERT', 'MICHAEL', 'WILLIAM']

In [26]:
from collections import Counter

In [27]:
Counter(names).most_common(10)

[('JAMES', 1),
 ('JOHN', 1),
 ('ROBERT', 1),
 ('MICHAEL', 1),
 ('WILLIAM', 1),
 ('DAVID', 1),
 ('RICHARD', 1),
 ('CHARLES', 1),
 ('JOSEPH', 1),
 ('THOMAS', 1)]

### Getting Data from APIs (GET / POST)