# Pandas for Data Analysis: Getting the Data

**Outline:**

* [Dealing with Files](#Dealing-with-Files)
  * [Reading Data from File](#Reading-Data-from-File)
  * [Writing Data to File](#Writing-Data-to-File)
* Fetching Data from APIs
* Web Scraping to Build Your Own Dataset - http://blog.kaggle.com/2017/01/31/scraping-for-craft-beers-a-dataset-creation-tutorial/

## Dealing with Files

### Reading Data from File

#### CSV File

UCI Machine Learning Repository: [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/Adult)

In [45]:
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

In [46]:
adult.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [47]:
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)

In [48]:
adult.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [49]:
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult.columns = columns

In [50]:
adult.head(2)

Unnamed: 0,age,Work Class,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Money Per Year
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [52]:
adult['age'][0:3]

0    39
1    50
2    38
Name: age, dtype: int64

In [None]:
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=columns)

In [None]:
adult.head()

In [None]:
adult['age']

In [56]:
adult.age.value_counts(ascending=True)[0:5]

86    1
87    1
88    3
85    3
83    6
Name: age, dtype: int64

In [58]:
adult[adult.age == adult.age.value_counts().index[0]]['sex'].value_counts()

 Male      611
 Female    287
Name: sex, dtype: int64

#### JSON File

In [59]:
!cat try_series.json

{
    "name": "Kan Ouivirach",
    "email": "kan@prontomarketing.com"
}


In [60]:
series_data = pd.read_json('try_series.json', typ='series')

In [61]:
series_data

email    kan@prontomarketing.com
name               Kan Ouivirach
dtype: object

In [62]:
!cat try_df.json

[
    {
        "name": "Kan Ouivirach",
        "email": "kan@prontomarketing.com"
    },
    {
        "name": "Some Data Scientist",
        "email": "someone@datascience.th"
    }
]


In [63]:
df = pd.read_json('try_df.json')

In [64]:
df

Unnamed: 0,email,name
0,kan@prontomarketing.com,Kan Ouivirach
1,someone@datascience.th,Some Data Scientist


### Writing Data to File

In [65]:
adult.to_json('adult.json')

In [66]:
!ls

adult.csv                       reviews_Digital_Music_5.csv
adult.json                      reviews_Digital_Music_5.json.gz
exercise2.csv                   try_df.json
pandas-01.ipynb                 try_series.json
pandas-02.ipynb


In [67]:
adult = pd.read_json('adult.json')

In [68]:
adult.head(3)

Unnamed: 0,Money Per Year,Work Class,age,capital-gain,capital-loss,education,education-num,fnlwgt,hours-per-week,marital-status,native-country,occupation,race,relationship,sex
0,<=50K,State-gov,39,2174,0,Bachelors,13,77516,40,Never-married,United-States,Adm-clerical,White,Not-in-family,Male
1,<=50K,Self-emp-not-inc,50,0,0,Bachelors,13,83311,13,Married-civ-spouse,United-States,Exec-managerial,White,Husband,Male
10,>50K,Private,37,0,0,Some-college,10,280464,80,Married-civ-spouse,United-States,Exec-managerial,Black,Husband,Male


In [None]:
adult.tto_csv('adult.csv')

In [None]:
!ls