## Data Wrangling

    Gather
    Asses
    Clean

### A Definition and An Analogy
#### A Definition
    Wrangling is a weird word. Let’s check the definition. This is exactly what I did when I first heard the term and was perplexed just as you may be right now.

### So wrangling means to round up, herd, or take charge of livestock, like horses or sheep. Let's focus in on the sheep example.

    A shepherd's main goals are to get their sheep to their pastures to let them graze, guide them to market to shear them, and put them in the barn to sleep. Before any of that though, they must be rounded up in a nice and organized group. The consequences if they're not? These tasks take longer. If they're all scattered, some could also run off and get lost. A wolf could even sneak into the pack and feast on a few of them.

#### An Analogy
    The same idea of organizing before acting is true for those who are shepherds of data. We need to wrangle our data for good outcomes, otherwise there could be consequences. If we analyze, visualize, or model our data before we wrangle it, our consequences could be making mistakes, missing out on cool insights, and wasting time. So best practices say wrangle. Always.



https://en.wikipedia.org/wiki/Click-through_rate

https://charliepark.org/slopegraphs/


Scrap Data from 

https://www.kaggle.com/udacity/armenian-online-job-postings

## Gather: Download
    The dataset used in this lesson is hosted on this Kaggle Datasets page: Armenian Online Job Postings. Some context on this dataset, from the description section of that page:

    The online job market is a good indicator of overall demand for labor in an economy. This dataset consists of 19,000 job postings from 2004 to 2015 posted on CareerCenter, an Armenian human resource portal.

    Since postings are text documents and tend to have similar structures, text mining can be used to extract features like posting date, job title, company name, job description, salary, and more. Postings that had no structure or were not job-related were removed. The data was originally scraped from a Yahoo! mailing group.

#### Best Practice: Downloading Files Programmatically
    When downloading files from the internet, downloading can be done manually by clicking the download button (or sometimes right-clicking on a link and clicking "Save file as"). But best practice is actually to download files programmatically, i.e. with code, for two reasons: scalability and reproducibility.

#### Scalability: 
    Imagine you had a thousand files to download on a thousand different web pages, instead of just one. It'd take an eternity to point and click a thousand times. You can do the same with a few lines of code.

#### Reproducibility: 
    Someone, whether it's you or another person, is likely going to want to run your analysis later, so make downloading the dataset or datasets as easy on that person as possible. Reproducibility is also one of the main principles of the scientific method[https://en.wikipedia.org/wiki/Scientific_method#Documentation_and_replication]. You want to be able to prove to people that your analysis, visualization, etc. is legitimate. People need to know that given your data, your computational environment, your code, etc., that they can reproduce your results! Plus, the dataset or the web page it lives on may change, so if you include the date you downloaded the dataset, you give these future onlookers a chance to access archived copies of the dataset or at least understand why their results are different.
    
scientific method [https://en.wikipedia.org/wiki/Scientific_method#Documentation_and_replication]

#### UnZip files using python
https://docs.python.org/3/library/zipfile.html#zipfile-objects


In [2]:
import  zipfile

In [5]:
with zipfile.ZipFile("data/armenian-online-job-postings.zip") as fp:
        fp.extractall()

### what is a zip file ?
https://www.lifewire.com/zip-file-2622675

#### what is a contaxt managers ? 
https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/

#### Quality

Low quality data is commonly referred to as dirty data. Dirty data has issues with its content.

Imagine you had a table with two columns: Name and Height, like below:

A table with Name and Height headers

<img src="data1.png" height=200 weight=200>

### Common data quality issues include:

* missing data, like the missing height value for Juan.
* invalid data, like a cell having an impossible value, e.g., like negative height value for Kwasi. Having "inches" and "centimetres" in the height entries is technically invalid as well, since the datatype for height becomes a string when those are present. The datatype for height should be integer or float.
* inaccurate data, like Jane actually being 58 inches tall, not 55 inches tall.
* inconsistent data, like using different units for height (inches and centimetres).


    Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. Unfortunately, that’s a bit of an evasive definition but it gets to something important: there are no hard and fast rules for data quality. One dataset may be high enough quality for one application but not for another.

#### Tidiness

    Untidy data is commonly referred to as "messy" data. Messy data has issues with its structure.

Tidy data is a relatively new concept coined by statistician, professor, and all-round data expert <a href="http://hadley.nz/">Hadley Wickham</a>. I’m going to take a quote from his excellent <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">paper</a> on the subject:

    It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this paper focuses on a small, but important, aspect of data cleaning that I call data tidying: structuring datasets to facilitate analysis.

...

A dataset is messy or tidy depending on how rows, columns, and tables are matched up with observations, variables, and types. In tidy data:

* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.


<img src="data2.gif">

### Tidy Data in Python
http://www.jeannicholashould.com/tidy-data-in-python.html

### Assesment

#### visual Assessment

    Visual assessment is simple. Open your data in your favorite software application (Google Sheets, Excel, a text editor, etc.) and scroll through it, looking for quality and tidiness issues.

<img src="data3.png" height=400 width=400>

#### Programmatic Assessment

    Programmatic assessment tends to be more efficient than visual assessment. One simple example of a programmatic assessment is pandas' info method, which gives us the basic info of your DataFrame—like number of entries, number of columns, the types of each column, whether there are missing values, and more.

<img src="data4.png" height=400 width=400>



other example is using pandas' plotting capabilities through the plot method, though simple visualizations are more common in exploratory data analysis (we'll discuss this later in this lesson) rather than data wrangling.

These types of assessments are handy for gauging your data’s structure and also for quickly spotting things that we’ll need to clean.


<img src="data5.png" >

### Escape character
https://en.wikipedia.org/wiki/Escape_character

#### Explore Common Programmatic Assessments
    Now it's time to explore programmatic assessments for yourself! Again, this is where we use code to help detect problems in our data that aren’t as easily detectable with the human eye.

The data wrangling template is displayed in the Jupyter Notebook below with empty cells for four common programmatic assessment methods in pandas (documention pages linked below):

* head
* tail
* info
* value_counts

Execute these assessment as per the instructions in those cells. In the following quizzes, you'll be asked to replicate these statements.

For these quizzes and all quizzes going forward, don’t go into them just trying to get them right. Exploring is a key part of learning. Get the code right, then I encourage you to explore the documentation, try various parameters, try new things and see where things break. Error messages are your friend, because you can learn from them.

https://simplystatistics.org/2016/02/17/non-tidy-data/

### Clean: Intro

#### Improving Quality and Tidiness

    Cleaning means acting on the assessments we made to improve quality and tidiness.


#### Improving Quality

#### Improving quality doesn’t mean changing the data to make it say something different—that's data fraud.

Consider the animals DataFrame, which has headers for name, body weight (in kilograms), and brain weight (in grams). The last five rows of this DataFrame are displayed below:

<img src='data6.png'>

Examples of improving quality include:

* Correcting when inaccurate, like correcting the mouse's body weight to 0.023 kg instead of 230 kg
* Removing when irrelevant, like removing the row with "Apple" since an apple is a fruit and not an animal
* Replacing when missing, like filling in the missing value for brain weight for Brachiosaurus
* Combining, like concatenating the missing rows in the more_animals DataFrame displayed below

<img src='data7.png'>

### Improving Tidiness

    Improving tidiness means transforming the dataset so that each variable is a column, each observation is a row, and each type of observational unit is a table. There are special functions in pandas that help us do that.
    
### Programmatic Data Cleaning Process

The programmatic data cleaning process:

* Define
* Code
* Test
    
<b>Defining</b> means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.
    
<b>Coding</b> means translating these definitions to code and executing that code.

<b>Testing</b> means testing our dataset, often using code, to make sure our cleaning operations worked.

Defining, then Coding and Testing Immediately
For pedagogical purposes in this lesson, we will be performing the define, code, and test steps of cleaning data programmatically in order. In other words, we write all of the definitions, then convert all of the definitions to code, then test all of the cleaning operations.

In reality, it is often more practical to define a cleaning operation, then immediately code and test it. The data wrangling template still applies here, except you'll have multiple Define, Code, and Test subheadings, with third level headers (###) denoting each issue, as displayed below.

### Indexing, Slicing and Subsetting DataFrames in Python
https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset/

<a href="https://wiki.python.org/moin/ForLoop"> python loops </a>
<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html">Pandas: Options and Settings </a>

## Wrangling vs. EDA

    Here is one definition of EDA: an analysis approach that focuses on identifying general patterns in the data, and identifying outliers and features of the data that might not have been anticipated.

So where does data wrangling end and EDA start?

    Data wrangling is about gathering the right pieces of data, assessing your data's quality and structure, then modifying your data to make it clean. But the assessments you make and convert to cleaning operations won't make your analysis, viz, or model better, though. The goal is to just make them possible, i.e., functional.

    EDA is about exploring your data to later augment it to maximize the potential of our analyses, visualizations, and models. When exploring, simple visualizations are often used to summarize your data's main characteristics. From there you can do things like remove outliers and create new and more descriptive features from existing data, also known as feature engineering. Or detect and remove outliers so your model's fit is better.

In practice, wrangling and EDA can and often do occur together, but we're going to separate them for teaching purposes.

### ETL

You also may have heard of the extract-transform-load process also known as ETL. ETL differs from data wrangling in three main ways:

* The users are different
* The data is different
* The use cases are different

This article (Data Wrangling Versus ETL: What’s the Difference?) by Wei Zhang explains these three differences well.


https://en.wikipedia.org/wiki/Feature_engineering

https://tdwi.org/articles/2017/02/10/data-wrangling-and-etl-differences.aspx

https://en.wikipedia.org/wiki/Extract,_transform,_load


#### Data Wrangling Summary



#### Gather

* Depending on the source of your data, and what format it's in, the steps in gathering data vary.
* High-level gathering process: obtaining data (downloading a file from the internet, scraping a web page, querying an API, etc.) and importing that data into your programming environment (e.g., Jupyter Notebook).

#### Assess

Assess data for:

* Quality: issues with content. Low quality data is also known as dirty data.
* Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    Each variable forms a column.
    Each observation forms a row.
    Each type of observational unit forms a table.

Types of assessment:

* Visual assessment: scrolling through the data in your preferred software application (Google Sheets, Excel, a text editor, etc.).
* Programmatic assessment: using code to view specific portions and summaries of the data (pandas' head, tail, and info methods, for example).


#### Clean

Types of cleaning:

* Manual (not recommended unless the issues are single occurrences)
* Programmatic

The programmatic data cleaning process:
* Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
* Code: convert those definitions to code and run that code.
* Test: test your dataset, visually or with code, to make sure your cleaning operations worked.

Always make copies of the original pieces of data before cleaning!

#### Reassess and Iterate
* After cleaning, always reassess and iterate on any of the data wrangling steps if necessary.

#### Store (Optional)
* Store data, in a file or database for example, if you need to use it in the future.