Data Wrangling is the art of dealing with, and coverting missing or il-formatted data into a format that more easily lends itself to analysis. 

Before that, we have to get the data we need from files, databases, or from APIs.

On average, over 50% of time is spent combing through the data.


## 1. Acquring Data
- Acquiring data often isn't fancy.
- Find stuff on the Internet!
- A lot of data stored in text files and on gov't websites.

### Common Data Formats
1. CSV (Comma Separated Value)
    
    Header1, Header2, ...
    
    Content1, Content2, ...

2. XML 
    <Document Element>
        <Table>
            <Head1> Content </Head1>
            <Head2> Content2 </Head2>
            <Head3/>
3. JSON
    
    {
        "Head 1" : 1,
        "Head 2" : "Content2"
        "Head 3" : " ",
    }
    
 
XML and JSON support nested structures while CSV does not.

### 1) CSV Data

In [11]:
import pandas as pd

# Importing CSV file
csv_data = pd.read_csv("inconsistency.csv")
csv_data

Unnamed: 0,age,agegroup,height,status,yearsmarried
0,21,adult,6.0,single,1
1,2,child,3.0,married,0
2,18,adult,5.7,married,20
3,221,elderly,5.0,widowed,2
4,34,child,7.0,married,3


In [6]:
print(csv_data['agegroup'])

csv_data['new_col'] = csv_data['height'] * csv_data['yearsmarried']
print(csv_data)

0      adult
1      child
2      adult
3    elderly
4      child
Name: agegroup, dtype: object
   age agegroup  height   status  yearsmarried  new_col
0   21    adult     6.0   single             1      6.0
1    2    child     3.0  married             0      0.0
2   18    adult     5.7  married            20    114.0
3  221  elderly     5.0  widowed             2     10.0
4   34    child     7.0  married             3     21.0


In [15]:
# Exporting DataFrame to CSV
csv_data.to_csv("NewCSVfile.csv")

In [16]:
pd.read_csv("NewCSVfile.csv")

Unnamed: 0.1,Unnamed: 0,age,agegroup,height,status,yearsmarried
0,0,21,adult,6.0,single,1
1,1,2,child,3.0,married,0
2,2,18,adult,5.7,married,20
3,3,221,elderly,5.0,widowed,2
4,4,34,child,7.0,married,3


In [18]:
# Replacing spaces with underscores, setting all chars to lowercase

csv_data.rename(columns = lambda x : x.replace(' ', '_').lower(), inplace=True)

### 2) Relational Database

Why the Relational Database?

It is straightforward to extract aggregated data with complext filters.

A database scales well.

It ensures all data is consistently formatted.

### Schema

Schema is a blueprint that tells the database how we plan and store our data.

### Simple Queries (SQL like queries)

SELECT * from aadhaar_data
LIMIT 20;

SELECT district, subdistrinct from aadhaard_data;

In [None]:
# From pandas, using SQL-like queries
import pandas
import pandasql

def select_first_50(filename) :
    aadhaard_data = pandas.read_csv(filename)
    addhaard_data.rename(columns = lambda x : x.replace(' ', '_').lower(), inplace=True)
    
    queries = """ select registrar, enrolment_agency from aadhaard_data limit 50; """
    
    aadhaard_solution = pandasql.sqldf(queries.lower(), locals())
    return aadhaard_solution

select_first_50("https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/aadhaar_data.csv")


## pandasql이 없기 때문에 결과는 없지만 위처럼 Query를 python에서도 실행할 수 있다는 점!

### Complex Queries

SELECT * from aadhaar_data where state='Gujarat'; 

SELECT district, sum(aadhaar_generated) from aadhaard_data group by district ;

SELECT district, subdistrict, sum(aadhaard_generated) from aadhaard_data where age > 60 group by district, subdistrict ;

In [None]:
# From pandas, using SQL-like queries
import pandas
import pandasql

def select_first_50(filename) :
    aadhaard_data = pandas.read_csv(filename)
    addhaard_data.rename(columns = lambda x : x.replace(' ', '_').lower(), inplace=True)
    
    queries = """select gender, district, sum(aadhaar_generated) 
                from aadhaar_data where age > 50 group by gender, district;"""

    
    aadhaard_solution = pandasql.sqldf(queries.lower(), locals())
    return aadhaard_solution

select_first_50("https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/aadhaar_data.csv")


## pandasql이 없기 때문에 결과는 없지만 위처럼 Query를 python에서도 실행할 수 있다는 점!
## group by를 해주는 기준은 반드시 Select 해야한다.

### 3) APIs

APIs (Application Programming Interface)

- REST API (Representational State Transfer)

Interacting with on API

http://ws.audioscrabbler.com/2.0/?method=album.getinfo&api_key=[API_KEY]&artist=Rihanna&album=Loud&format=json

위에서 ? 이후의 주소는 API parameter를 정의하는 것이다.

In [None]:
import json
import requests

url = "url"
data = requests.get(url).text
data = json.loads(data)   #json 데이터를 Python dict으로 바꿈
print(data['topartists']['artist'][0]['name']
data['artist']

## 2. Cleansing Data

* Does the data make sense?

* Is there a problem?

* Does the data look like I expect it to?

In [None]:
import pandas
baseball = pandas.read_csv("pathname")
baseball.describe()

# pandas의 describe을 통해 전체 그림을 살펴볼 수 있다.