# Data Engineering

**Data engineers** gather data from various sources, process, combine and manipulate the data so it can be easily accessed and used. This whole process from gathering data to getting it ready to use can be automated by software programs, also called **data pipelines**.

Data engineers are responsible for gather data from various sources, process, and combine and the data so it can be easily accessed and used. This process is often automated by data pipelines.

As a data scientist, it's important to have some level of skill in data engineering despite the size of the company you work in.

In larger companies, although there are dedicated data engineers, data scientists need to communicate with data engineers about the data they need for the data science project. In smaller companies, data scientists may need to take some of the responsibility of a data engineer.

## Data Pipelines: ETL vs ELT
Data pipeline is a generic term for moving data from one place to another. For example, it could be moving data from one server to another server.

### ETL
An ETL pipeline is a specific kind of data pipeline and very common. ETL stands for Extract, Transform, Load. Imagine that you have a database containing web log data. Each entry contains the IP address of a user, a timestamp, and the link that the user clicked.

What if your company wanted to run an analysis of links clicked by city and by day? You would need another data set that maps IP address to a city, and you would also need to extract the day from the timestamp. With an ETL pipeline, you could run code once per day that would extract the previous day's log data, map IP address to city, aggregate link clicks by city, and then load these results into a new database. That way, a data analyst or scientist would have access to a table of log data by city and day. That is more convenient than always having to run the same complex data transformations on the raw web log data.

Before cloud computing, businesses stored their data on large, expensive, private servers. Running queries on large data sets, like raw web log data, could be expensive both economically and in terms of time. But data analysts might need to query a database multiple times even in the same day; hence, pre-aggregating the data with an ETL pipeline makes sense.

### ELT
ELT (Extract, Load, Transform) pipelines have gained traction since the advent of cloud computing. Cloud computing has lowered the cost of storing data and running queries on large, raw data sets. Many of these cloud services, like Amazon Redshift, Google BigQuery, or IBM Db2 can be queried using SQL or a SQL-like language. With these tools, the data gets extracted, then loaded directly, and finally transformed at the end of the pipeline.

However, ETL pipelines are still used even with these cloud tools. Oftentimes, it still makes sense to run ETL pipelines and store data in a more readable or intuitive format. This can help data analysts and scientists work more efficiently as well as help an organization become more data driven.

## World Bank Data
This lesson uses data from the World Bank. The data comes from two sources:

1. [World Bank Indicator Data](https://data.worldbank.org/indicator) - This data contains socio-economic indicators for countries around the world. A few example indicators include population, arable land, and central government debt.
2. [World Bank Project Data](https://datacatalog.worldbank.org/dataset/world-bank-projects-operations) - This data set contains information about World Bank project lending since 1947.

Both of these data sets are available in different formats including as a csv file, json, or xml. You can download the csv directly or you can use the World Bank APIs to extract data from the World Bank's servers. You'll be doing both in this lesson.

## Summary of the data file types you'll work with
### CSV files
CSV stands for comma-separated values. These types of files separate values with a comma, and each entry is on a separate line. Oftentimes, the first entry will contain variable names. Here is an example of what CSV data looks like. This is an abbreviated version of the first three lines in the World Bank projects data csv file.
```
id,regionname,countryname,prodline,lendinginstr
P162228,Other,World;World,RE,Investment Project Financing
P163962,Africa,Democratic Republic of the Congo;Democratic Republic of the Congo,PE,Investment Project Financing
```
### JSON
JSON is a file format with key/value pairs. It looks like a Python dictionary. The exact same CSV file represented in JSON could look like this:
```json
[{"id":"P162228","regionname":"Other","countryname":"World;World","prodline":"RE","lendinginstr":"Investment Project Financing"},{"id":"P163962","regionname":"Africa","countryname":"Democratic Republic of the Congo;Democratic Republic of the Congo","prodline":"PE","lendinginstr":"Investment Project Financing"},{"id":"P167672","regionname":"South Asia","countryname":"People\'s Republic of Bangladesh;People\'s Republic of Bangladesh","prodline":"PE","lendinginstr":"Investment Project Financing"}]
```
Each line in the data is inside of a squiggly bracket {}. The variable names are the keys, and the variable values are the values.

There are other ways to organize JSON data, but the general rule is that JSON is organized into key/value pairs. For example, here is a different way to represent the same data using JSON:
```json
{"id":{"0":"P162228","1":"P163962","2":"P167672"},"regionname":{"0":"Other","1":"Africa","2":"South Asia"},"countryname":{"0":"World;World","1":"Democratic Republic of the Congo;Democratic Republic of the Congo","2":"People\'s Republic of Bangladesh;People\'s Republic of Bangladesh"},"prodline":{"0":"RE","1":"PE","2":"PE"},"lendinginstr":{"0":"Investment Project Financing","1":"Investment Project Financing","2":"Investment Project Financing"}}
```
### XML
Another data format is called XML (Extensible Markup Language). XML is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set. Here is what this same data would look like as XML.

```xml
<ENTRY>
  <ID>P162228</ID>
  <REGIONNAME>Other</REGIONNAME>
  <COUNTRYNAME>World;World</COUNTRYNAME>
  <PRODLINE>RE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
<ENTRY>
  <ID>P163962</ID>
  <REGIONNAME>Africa</REGIONNAME>
  <COUNTRYNAME>Democratic Republic of the Congo;Democratic Republic of the Congo</COUNTRYNAME>
  <PRODLINE>PE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
<ENTRY>
  <ID>P167672</ID>
  <REGIONNAME>South Asia</REGIONNAME>
  <COUNTRYNAME>People's Republic of Bangladesh;People's Republic of Bangladesh</COUNTRYNAME>
  <PRODLINE>PE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
```

ML is falling out of favor especially because JSON tends to be easier to navigate; however, you still might come across XML data. The World Bank API, for example, can return either XML data or JSON data. From a data perspective, the process for handling HTML and XML data is essentially the same.

### SQL databases
SQL databases store data in tables using [primary and foreign keys](https://docs.microsoft.com/en-us/sql/relational-databases/tables/primary-and-foreign-key-constraints?view=sql-server-2017). In a SQL database, the same data would look like this:

|id	| regionname	|countryname|	prodline|	lendinginstr |
|---|-------------|-----------|---------|--------------|
|P162228	|Other	|World;World	|RE	|Investment Project Financing|
|P163962	|Africa	|Democratic Republic of the Congo;Democratic Republic of the Congo	|PE	|Investment Project Financing|
|P167672	|South Asia|	People's Republic of Bangladesh;People's Republic of Bangladesh	|PE|	Investment Project Financing|

### Text Files
This course won't go into much detail about text data. There are other Udacity courses, namely on natural language processing, that go into the details of processing text for machine learning.

Text data present their own issues. Whereas CSV, JSON, XML, and SQL data are organized with a clear structure, text is more ambiguous. For example, the World Bank project data country names are written like this
```
Democratic Republic of the Congo;Democratic Republic of the Congo
```
In the World Bank Indicator data sets, the Democratic Republic of the Congo is represented by the abbreviation "Congo, Dem. Rep." You'll have to clean these country names to join the data sets together.

### Extracting Data from the Web
In this lesson, you'll see how to extract data from the web using an APIs (Application Programming Interface). APIs generally provide data in either JSON or XML format.

Companies and organizations provide APIs so that programmers can access data in an official, safe way. APIs allow you to download, and sometimes even upload or modify, data from a web server without giving you direct access.




## Transform

Transforming data means getting data ready for a machine learning algorithm or other data science projects. Transforming data can involve a wide range of processes such as combining data from different sources, cleaning the data, engineering new features. Hence, you transform the original datasets to create a new dataset that is ready for use.

### Combining Data

Oftentimes you need to combine data from different datasets. The process of combining data can be complicated and very different from case to case.

Datasets can be from different sources and have different content so you need to check the data closely before combining them. Or they may be in different formats meaning you will need to transform data from one format to another first.

In the next part of the lesson, you will have an exercise for combining data from different sources using the Python Pandas package. If you are unfamiliar with the Pandas, you can review the learning resources below to prepare yourself before heading to the exercise.

### Cleaning Data

Dirty data refers to data that contains errors. Data error can come from many sources including:

* data entry mistakes
* duplicate data
* incomplete records
* inconsistencies between dataset

It's very important to audit data after obtaining it, otherwise, the data science projects will not perform well or even worse, give you wrong results.

In the next section, you will have an exercise on data cleaning. Your job is to clean the data so the country names across different datasets are consistent.

## Missing Data
In the video, I say that a machine learning algorithm won't work with missing values. This is essentially correct; however, there are a couple of situations where this isn't quite true. For example, if you had a categorical variable, you could keep the NULL value as one of the options.

Like if theme_2 could have a value of agriculture, banking, or NULL, you might encode this variable as 0, 1, 2 where the value 2 stands in for the NULL value. You could do something similar for one-hot encoding where the theme_2 variable becomes 3 true/false features: theme_2_agriculture, theme_2_banking, theme_2_NULL. You could have to make sure that this improves your model performance.

There are also implementations of some machine learning algorithms, such as [gradient boosting](https://xgboost.readthedocs.io/en/latest/) decision trees that can [handle missing values](https://github.com/dmlc/xgboost/issues/21).

### Missing Data - Delete
There are two ways to handle missing values:

* Delete data
* Fill missing values (also called imputation)

You can delete a feature with a large percentage of missing values because it's unlikely to contribute to your machine learning model unless somehow you can find the missing values from another source. If you think deleting a row consisting of a lot of missing values won't affect your result, you can also delete it.

However, deleting missing values is not the only option. In the next part, we will talk about another option - fill in missing values.

### Missing Data - Inpute
The process of filling in missing values is called `imputation`. Some of the imputation techniques are mean substitution, forward fill, and backward fill.

##### Mean Substitution
The mean substitution method fills in the missing values using the `column mean`. You can even group the data first then filling the data by the means in different groups. Alternatively, you can fill in the missing values by `median` or `mode`.

##### Forward Fill or Backward Fill
Forward fill or backward fill works for `ordered` or `times series` data. In both methods, you can use neighboring cells to fill in the missing value.

To use these methods, you should always make sure your data is `sorted` by a timestamp or in a meaningful way. With the forward fill, values are pushed forward `down` to replace any missing values. With backward fill, values move `up` to fill missing values.

## Duplicate Data
Data duplication is obvious when the same row shows up more than once and you can simply remove the duplicate data using the Pandas drop duplicates method. However, duplicate data can sometimes be trickier to find which requires you to comb through the data to recognize and eliminate duplications.

## Dummy Variables

### When to Remove a Feature
As mentioned in the video, if you have five categories, you only really need four features. For example, if the categories are "agriculture", "banking", "retail", "roads", and "government", then you only need four of those five categories for dummy variables. This topic is somewhat outside the scope of a data engineer.

In some cases, you don't necessarily need to remove one of the features. It will depend on your application. In regression models, which use linear combinations of features, removing a dummy variable is important. For a decision tree, removing one of the variables is not needed.


## Outliers

### Outliers - How to Find Them
**Outliers** are data points that are far away from the rest of the data. Outliers can be caused by errors such as data entry errors or processing mistakes. However, outliers can also be legitimate data.

There are some methods to detect outliers:

**Data visualization**
When working with one or two-dimensional data, you can visualize data to detect outliers. The data points far away from the rest of the data on the plot can potentially be outliers.

**Statistical methods**
Statistical properties of data like means, standard deviations, quantiles can be used to identify outliers, for example, the z-score from the norma distribution and the Tukey method.

**Machine learning methods**
When it comes to high-dimensional data, you can use machine learning techniques such as PCA to reduce the data dimensions. Then you can use the methods discussed before to detect outliers.

Another way is to cluster the data then calculate the distance from each data point to the cluster centroid. A large distance may indicate an outlier.

Next, you will get some practice on identifying outliers.

#### Outlier Detection Resources [Optional]
Here are a couple of links to outlier detection processes and algorithms. Since this is an ETL course rather than a statistics course, you don't need to read these in order to complete the lesson.

* [scikit-learn novelty and outlier detection](http://scikit-learn.org/stable/modules/outlier_detection.html)
* [statistical and machine learning methods for outlier detection](https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561)

#### Tukey Rule
* Find the first quartile (ie .25 quantile)
* Find the third quartile (ie .75 quantile)
* Calculate the inter-quartile range (Q3 - Q1)
* Any value that is greater than Q3 + 1.5 * IQR is an outlier
* Any value that is less than Qe - 1.5 * IQR is an outlier

### Outliers - What to do
Now you've identified the outliers then what next? The answer varies depending on how the outlier affects your machine learning model. If removing the outlier improves your machine learning model, maybe it's best to delete the outliers. But if the outlier has a special meaning so it has to be taken into account in your model, then it's best to keep it or find other solutions. In another case, if removing the outlier has little or no effect on your results, you can leave the outliers as it.

## Scaling Data
Numerical data comes in all different distribution patterns and ranges. This can be an issue for machine learning algorithms that calculate Euclidean distance between points, such as PCA, linear regression with gradient descent. This issue can be solved by performing data **normalization** or **feature scaling**. Two common ways to scale features are **rescaling** and **standardization**.

### Rescaling / Normalization
With rescaling, the distribution of the data remains the same but the range changes to 0-1. You scale down the data range so the minimum value is zero and the maximum value is one.

To normalize data, you take a feature, like gdp, and use the following formula

$x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$

where 
* x is a value of gdp
* x_max is the maximum gdp in the data
* x_min is the minimum GDP in the data


### Standardization
With standardization, the general shape of the distribution remains the same but the mean and standard deviation are standardized. You transform the data so that it has a mean of zero and a standard deviation of one.

## Feature Engineering
Feature engineering refers to the process to create new features and it is a very broad topic. Data engineers wouldn't necessarily decide what feature is a good one to engineer but they might be asked to write code that transforms data into a new feature.

In the video above, we looked at taking the polynomial of data as a feature engineering example. Using the two features `x` and `y` , you can create many new features like `x^2`, `x^3`, `xy`, `y^2`, `x^2y`, and so on. Creating new features is especially useful if your model is underfitting in which existing features can't capture the trend.



## Load
**Load** refers to store the data in a database. There are many options for data storage.

* Relational database like SQL works well with structured data
* CSV files work well with data that fits in a Pandas DataFrame

Part of the data engineers' job is to know how to work with different data storage options.

#### Links to Other Data Storage Systems
* ranking of database engines
* Redis
* Cassandra
* Hbase
* MongoDB

## Lesson Summary

In this lesson, you've learned to create ETL pipelines: you pull the data from different sources, transform the data using various techniques and load the transformed data in a data storage.

#### I. Extract data from different sources
* csv files
* json files
* APIs

#### II. Transform data
* combining data from different sources
* data cleaning
* data types
* parsing dates
* file encodings
* missing data
* duplicate data
* dummy variables
* remove outliers
* scaling features
* engineering features

#### III. Load data
* send the transformed data to a database

# NLP Pipelines

In this lesson, you'll be introduced to some of the steps involved in a NLP pipeline:

1. Text Processing
    * Cleaning
    * Normalization
    * Tokenization
    * Stop Word Removal
    * Part of Speech Tagging
    * Named Entity Recognition
    * Stemming and Lemmatization
2. Feature Extraction
    * Bag of Words
    * TF-IDF
    * Word Embeddings
3. Modeling

## How NLP Pipelines Work
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

* **Text Processing**: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
* **Feature Extraction**: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
* **Modeling**: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.


## Text Processing

### Stage 1: Text Processing
The first chunk of this lesson will explore the steps involved in text processing, the first stage of the NLP pipeline.

You'll prepare text data from different sources with the following text processing steps:

1. **Cleaning** to remove irrelevant items, such as HTML tags
2. **Normalizing** by converting to all lowercase and removing punctuation
3. Splitting text into words or **tokens**
4. Removing words that are too common, also known as **stop words**
5. Identifying different **parts of speech** and **named entities**
6. Converting words into their dictionary forms, using **stemming and lemmatization**

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

#### Why Do We Need to Process Text?
* **Extracting plain text**: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
* **Reducing complexity**: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

### Cleaning
Let's walk through an example of cleaning text data from a popular source - the web. You'll be introduced to helpful tools in working with this data, including the **requests** library, **regular expressions**, and **Beautiful Soup**.

```python
import requests

# Fetch a web page
r= requests.get("https://www.udacity.com/courses/all")
print(r.text)
```
> Outputs Entire HTML Source

This seems inefficient so let's use the `BeautifulSoup` library to help.
```python
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup Library
soup = BeautifulSoup(r.text, "html5lib") # pass in raw web page and create a soup object
print(soup.get_text()) # extract plain text from all the tags
```
> Outputs No tags but there is still JavaScript and tons of white space.

Go to the webpage and take a Inspect the element you are interested in. Let's first look at the `course-summary-card`
```python
# Find all course summaries
summaries = soup.find_all("div", class_="course-summary-card")
summaries[0]
```
> Outputs A list of all the `course-summary-card` values and will print out the 1st item in that list.

Looking at the summary we find that the title is nested in an `<a>` tag inside an `<h3>` tag.
```python
# Extract title
summaries[0].select_one("h3 a").get_text().strip()
```
> Output:  Intro to Programming Nanodegree

Going back we can also see the summary is in a `<div data-course-short-summary="">` tag.
```python
# Extract description
summaries[0].select_one("div[data-course-short-summary]").get_text().strip()
```
> Output: 
> *Udacity's Intro to Programming is your first step towards careers in Web and App Development, Machine Learning, Data Science, AI, and more! This program is perfect for beginners.*

```python
# Find all course summaries, extract title and descriptiion
courses = []
summaries = soup.find_all("div", class_="course-summary-card")
for summary in summaries: 
  title = summary.select_one("h3 a").get_text().strip() 
  description = summary.select_one("div[data-course-short-summary]").get_text().strip()
  print("***", title, "***")
  print(description)

print(len(courses), "course summaries found. Sample:") 
print(courses[0][0])
print(courses[0][1])
```

> Output: 193 course summaries found. Sample:
Intro to Programming Nanodegree
Udacity's Intro to Programming is your first step towards careers in Web and App Development, Machine Learning, Data Science, AI, and more! This program is perfect for beginners.

#### Documentation for Python Libraries:
* [Requests](http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)
* [Regular Expressions](https://docs.python.org/3/library/re.html)
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Normalization
Plain text is still human language with all its variations and bells and whistles so in normalization, we will try to reduce some of that complexity.

#### Capitalization Removal
In the English language, the starting letter of the first word in any sentence is usually capitalized. All caps are sometimes used for emphasis and for stylistic reasons. While this is convenient for a human reader from the standpoint of a machine learning algorithm, it does not make sense to differentiate between variations that mean the same thing:

* Car
* car
* CAR

Therefore, we usually convert every letter in our text to a common case, usually lowercase, so that each word is represented by a unique token.

Here's some sample text from a movie review:
> The first time you see The Second Renaissance it may look boring. Look at it at least twice and definetly watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?

If we have the review stored in a variable called text, converting it to lowercase is a simple call to the lower method in Python.
```python
# Conver to lowercase
text = text.lower()
print(text)
```
> **Output**
the first time you see the second renaissance it may look boring. look at it at least twice and definetly watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?

Note all the letters that were changed.

#### Punctation Removal
Other languages may or may not have a case equivalent but similar principles may apply depending on your NLP task, you may want to remove special characters like periods, question marks, and exclamation points from the text and only keep letters of the alphabet and maybe numbers.

This is useful when looking at text documents as a whole in applications like document classification and clustering where the low level details doesn't affect the application.

To do this we can use a regular expression that matches everything that is not a lowercase A to Z, uppercase A is Z, or digits zero to nine, and replaces them with a space.

```python
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) # Anything that isn't A through Z or 0 through 9 will be replaced by a space
print(text)
```
> **Output**
the you he first time you see the second renaissance it may look boring look at it at least twice and definetly watch part 2 it will change your view of the matrix are the human people the ones who started the war is ai a bad thing

This approach avoids having to specify all punctuation characters, but you can use other regular expressions as well.

Lowercase conversion and punctuation removal are the two most common text normalization steps. If and when you apply these steps depends on your end goal and how you design your pipeline.

> It is better to replace the punctuation with a space than tho remove them. Replacing with a space makes sure that words don't get concatenated together, in case the original text did not have a space before or after the punctuation.

### Tokenization
Token is a fancy term for a symbol that holds some meaning and is not typically split up any further.

In natural language processing, our tokens are usually individual words. This means that the process of tokenization is simply splitting a sentence into a sequence of words. The simplest way to do this is using the split method which returns a list of words.

> **Input**
the you he first time you see the second renaissance it may look boring look at it at least twice and definetly watch part 2 it will change your view of the matrix are the human people the ones who started the war is ai a bad thing

```python
# Split text into tokens (words)
words = text.split()
print(words)
```
> **Output**
['the', 'you', 'he', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definetly', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']

Notice that it splits on whitespace characters (spaces, tabs, new lines, etc.) and will automatically ignoring two or more whitespace characters in a sequence, so it doesn't return blank strings. This can be further adjusted using optional parameters.

### Natural Language Toolkit (NLTK)
So far, we've only been using Python's built-in functionality, but some of these operations are much easier to perform using a library like Natural Language Toolkit (NLTK).

The most common approach for splitting up text in NLTK is to use the word tokenized function from nltk.tokenize.
```python
from nltk.tokenize import word_tokenize

#Split text into words using NLTK
words = word_tokenize(text)
print(words)
```
This performs the same task as split but has a few more features than the split method.

For example if we gave it
> Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.

it would return the following

> ['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']

You'll notice that the punctuations are treated differently based on their position. For example, 'Dr.' has been tokenized as one word rather than being tokenized into two seperate entities 'Dr' and '.'. NLTK is using some rules or patterns to decide what to do with each punctuation.

#### NLTK's Sentence Tokenization
There are instances you may need to split a longer docuument into sentence, this is something that might be done for translations. You can achieve this with NLTK using sent tokenize.

```python
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences) 
```
> **Output**
['Dr.Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']

Now one could tokenize based on words if needed.

NLTK provide several other tokenizers and here are some of them:

* regular expression based tokenizer that can remove punctuation and perform tokenization in a single step
* tweet tokenizer that is aware of twitter handles, hash tags, and emoticons
Reference:

* nltk.tokenize package: http://www.nltk.org/api/nltk.tokenize.html

### Stop Word Removal
**Stop Words** are words that don't add a lot meaning to a sentence or phrase (i.e, is, the, in, at, etc.) and are often very common words.

We want to remove them to simplify procedures down the pipeline.

For example you may have the statement:
> Dogs are the best

Even with removing "are" and "the", the positive sentiment about dogs is still conveyed.

A common package that has a pre-set list of stop words is NLTK.

```python
# List stop words from NLTK
from nltk.corpus import stopwords
print(stopwords.words("english"))
```

The NLTK can be used on a list of words.
```python
words = ['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definetly', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)
```


### Part-of-Speach Tagging
**Note**: Part-of-speech tagging using a predefined grammar like this is a simple, but limited, solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!

There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including Hidden Markov Models (HMMs) and Recurrent Neural Networks (RNNs).

NLTK has the ability to label the parts of speach of the words given.

```python
from nltk import pos_tag

# Tag parts of speach (PoS)
sentence = word_tokenize("I always lie down to tell a lie.")
pos_tag(sentence)
```
output
```python
[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]
```
Custom grammar to parse an ambiguous sentence and will return the possible ways the sentence could be read.
```python
# Define a cusom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
  print(tree)
```
This can be even further visualized with the draw function on the tree
```python
# Visualize parse trees
for tree in parser.parse(sentence):
  tree.draw()
```
To learn more about NLTK PoS

* NLTK Documentation on pos_tag in this link to [Chapter 5. Categorizing and Tagging Words](http://www.nltk.org/book/ch05.html)
* Stack Overflow thread on the tokens for pos_tag in this link to [What are all possible pos tags of NLTK](https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk)?

### Named Entity Recognition
Named Entity are nouns or noun phrases that refer to specific object, person, or place.

To label these we can use the `ne_chunk` function in NLTK.
```python
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# Recognize named entities in a tagged sentence
ne_chunk(pos_tag(word_tokenize("Antonio joined Udacity Inc. in California.")))
```

### Stemming and Lemmatization

#### Stemming
Stemming is the process of reducing a word to its stem or root form. For example, branching, branched, and branches all stem from the word branch.

This is a very quick and rough process so sometime the result isn't a complete word. For example, caching, cached, caches would result in a stem "cach", but that isn't a word. But as long as all related words to cache results in the same stem still captures the common idea in the resultant stem.

There are a few options from NLTK but in this example we will look at Porter.
```python
from nltk.stem.porter import PorterStemmer

#Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)
```

#### Lemmatization
**Lemmatization** is the process to map the words back to its root using a dictionary. For example, is, was, and were would all be lemmatized to "be".

The default NLTK lemmatizer is wordnet.

```python
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)
```
Lemmatizers need to know the part of speech and will default to nouns but we can add parameters to change which part of speech it will use.
```python
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)
```

### Text Processing Summary
Let's go over a summary of all the text processing we just covered.

|Text  Processing | Example |
|--|--|
|Given	|Jenna went back to University|
|Normalized	|jenna went back to university|
|Tokenized	|<"jenna", "went", "back", "to", "University">|
|Stop Word Removal	|<"jenna", "went", "university">|
|Stem & Lemmatized	|<"jenna", "go", "univers">|

### Feature Extraction
Each letter is represented using encodings like ASCII or Unicode so that each letter is represented by a number which is then stored or transmitted using binary (0s or 1s).

Words, rather than letters themselves, hold meaning. But computers don't have a standard representation for words. Practically they are a sequence of binary, ASCII, or Unicode but the meaning and relationship between words is not easily captured with these methods.

In comparison, an image's pixel value contains the relative intensity of light. For a color image we keep a value for each of the primary colors (red, green, and blue) which carry relavant information. This means that pixels with similar values are also visually similar. So pixel values can be used in a numerical model for images.

#### How can we do the same thing for image modelsing with language?

Depends on the model and goal of the model.

For example, for a graph based model to extract insights you might create a web of nodes.

But if you want a statistical model, you will need numerical representation.

* If you are working at the document level (for spam detection or sentiment of the document) one would use bag-of-words or doc2vec.
* If you are working at the individual words and phrases (for text generation or machine translation) one would use word2vec or glove.
Practice will help over time to determine which is the best method for your use case.

WordNet visualization tool: http://mateogianolio.com/wordnet-visualization/

### Bag of Words
Each document is turned into an unordered collection of words. For a plagiarism check for students in a class each submission or report could be considered a document. But if you are looking at sentiment in a tweet, each tweet would be considered a document.

The first step is text processing and below is a table of the Given and result after text processing.

|Given | Text Processed |
|--|--|
|Little House on the Prairie | {"littl", "hous", "prairi"}|
|Mary had a Little Lamb	| {"mari", "littl", "lamb"}|
|The Silence of the Lambs | {"silenc", "lamb"}|
|Twinkle Twinkle Little Star | {"twinkl", "littl", "star"}|

The above table is a good start but the result doesn't represent that there was two "Twinkle"s from "Twinkle Twinkle Little Star". A better way to do this is with a Document-Term Matrix.

This is usually done with a set of documents, known as a corpus (D).

#### Document-Term Matrix
| | littl |	hous |	prairi |	mari |	lamb |	silenc |	twinkl |	star |
|--|--|--|--|--|--|--|--|
|Little House on the Prairie	|1|	1|	1|	0|	0|	0|	0|	0|
|Mary had a Little Lamb	|1|	0|	0|	1|	1|	0|	0|	0|
|The Silence of the Lambs	|0|	0|	0|	0|	1|	1|	0|	0|
|Twinkle Twinkle Little Star	|1|	0|	0|	0|	0|	0|	2|	1|

Each number in the vector is called a term frequency.

#### Comparing Documents
Compare two documents based on how many words they have in common or how similar their terms frequencies are.
|	|	                        |littl|	hous|	prairi|	mari|	lamb|	silenc|	twinkl|	 star|
|---|---------------------------|-----|-----|---------|-----|-------|---------|-------|------|
|a	|Little House on the Prairie|	 1|    1|	     1|    0|	   0|	     0|  	 0|	    0|
|b	|Mary had a Little Lamb     |	 1|	   0|	     0|	   1|	   1|	     0|	     0| 	0|

This is done mathematically by a dot product which is the sum of the products of corresponding elements. The larger the dot product will indicate that the two vectors are more similar.

![dot](./02_NLP_pipelines/images/01.png)

|dot product of a and b =	|`1*1`	|`1*0`	|`1*0`	|`0*1`	|`0*1`	|`0*0`	|`0*0`	|`0*0`|
|---------------------------|-------|-------|-------|-------|-------|-------|-------|-----|
|=	|1	|0	|0|	0|	0	|0	|0	|0|
|=	|1	|   | |  |      |   |   | |

The dot product only captures the overlap but doesn't take into account the values that don't overlap. Sometimes this can result in comparing two very different documents leads to a result as documents that are identical.

The way to get around this is using cosine similarity. Which still uses the dot product as the numerator but will divide by the products of their magnitudes (Euclidean norms).

![docosine](./02_NLP_pipelines/images/02.png)

This esentially makes each of the vectors an arrow pointing in a direction and then calculates the theta of the angle made by the arrow of A and B. We can look at this for comparison between a and b in the table below.

| `cos(theta) = dot(a, b) / \|\|a\|\| x \|\|b\|\|=`	| `1/3`   |
|---------------------------------------------------|---------|
|`dot product of a and b =`                         |  `1`    |
|`\|\|a\|\|`	                                    |`sqrt(3)`|
|`\|\|b\|\|`	                                    |`sqrt(3)`|

Identical documents will have a result of 1 and documents that don't share any similarities will have a result of -1. But documents that share approximately half will result in an orthogonal vector with a result of 0.


### TF-IDF

#### Document Frequency
Bag of words treats each words as equally important. But based on our intiuition some words will occur more frequently in a corpus. For example, in financial documents, this corpus may have a high term frequency in terms like cost or price. To compensate for this we can count in how many documents each word occurs.

|	                         |littl	|hous |	prairi | mari |	lamb | silenc |	twinkl	| star |
|----------------------------|------|-----|--------|------|------|--------|---------|------|
|Little House on the Prairie |	1	|1	  | 1	   | 0	  | 0	 | 0	  |0	    | 0    |
|Mary had a Little Lamb	     |  1	|0	  | 0	   | 1	  | 1	 | 0	  |0	    | 0    |
|The Silence of the Lambs	 |  0	|0	  | 0	   | 0	  | 1	 | 1	  |0	    | 0    |
|Twinkle Twinkle Little Star |	1	|0	  | 0	   | 0	  | 0	 | 0	  |2	    | 1    |
|**Document Frequency**      |  3	|1	  | 1	   | 1	  | 2	 | 1	  |1	    | 1    |

Then divide the document Frequencies on all the values in the corpus. This now gives a proportional value of the term frequencies but is inversely proportional to how many documents that term appears in.

|	                         |littl	|hous |	prairi | mari |	lamb | silenc |	twinkl	| star |
|----------------------------|------|-----|--------|------|------|--------|---------|------|
|Little House on the Prairie |	1/3	|1	  | 1	   | 0	  | 0	 | 0	  |0	    | 0    |
|Mary had a Little Lamb	     |  1/3	|0	  | 0	   | 1	  | 1/2	 | 0	  |0	    | 0    |
|The Silence of the Lambs	 |  0	|0	  | 0	   | 0	  | 1/2	 | 1	  |0	    | 0    |
|Twinkle Twinkle Little Star |	1/3	|0	  | 0	   | 0	  | 0	 | 0	  |2	    | 1    |
|**Document Frequency**      |  3	|1	  | 1	   | 1	  | 2	 | 1	  |1	    | 1    |

Values with a higher value (i.e., "Mary" and "Silence") are unique to a particular docment while smaller values mean they are frequently used throughout the corpus (i.e., "Little" or "Lamb"). This allows for better charaterization.

#### Term Frequency - Inverse Document Frequency (TF-IDF) Transform
Includes two weights:

* Term Frequency (tf)
* Inverse Document Frequency (idf)

#### Term Frequency
Is mathematically defined as the count of a term (t) in a document (d) divided by all the terms in the document.

![dot](./02_NLP_pipelines/images/03.png)

#### Inverse Document Frequency
Is the logarithm of the total number of documents in the coprpus (D) divided by the number of documents where the term (t) exists.

![dot](./02_NLP_pipelines/images/04.png)

#### Resultant Equation of the TF-IDF
These come together into the following mathematical formula.

![dot](./02_NLP_pipelines/images/05.png)

There are many variations that try to smooth or normalize the results or try to prevent edge cases and division by zero errors.

But ultimately this is a good way to assign weight to words and indicate their relevance in a given document.

### Word Embeddings
As the number of words grows for a given dataset, One-hot encodings becomes less and less sustainable beacuse the size of the word representations grows with the number of words.

This is where word embeddings comes in where it limits the word representation to a fixed-size vector. This means for each word we want to find the embedding in a vector space which exhibit desired properties.

For example, words with similar meanings such as kid and child should be closer in comparison to words that have disparate meaning (i.e., rock).

Another example, are words that are different in similar ways like man, king, woman, and queen. The distance between man and woman should be similar to thew distance between king and queen.

For more on word embeddings, take a look at the optional content at the end of the lesson.

#### Word Embedding - Word2Vec
Word2Vec is one of the most popular used word embeddings. As the name indicates it transforms words into vectors but let's look at how that transformation is done.

The core idea is to predict a given word using neighboring words or the using a word to predict neighboring words. This indicates that the model is likely to have a strong grasp of contextual meaning of the words.

There are 2 main cases:

1. You are given a word and it predicts the neighboring words is called Continuous Skip-gram.
2. You are given neigboring words is called continous bag of words (CBoW).

##### Case 1: Skip-gram Model
In the Skip-gram model, a word is chosen from a sentence. This word is converted into a one-hot encoded vector and fed into a neural network or probabilistic model. The model is designed to predict a few surrounding words, its context. We then would optimize the model's weights or parameters and repeat till it best predicts the surrounding words.

Now, take an intermediate representation like a hidden layer in a neural network. The outputs of that layer for a given word become the corresponding word vector.

![skip-gram](./02_NLP_pipelines/images/06.PNG)
![word2vec](./02_NLP_pipelines/images/07.PNG)

##### Case 2: Continuous Bag of Words (CBoW)
Yields a very robust representation of words because the meaning of each word is distributed throughout the vector. The size of the word vector dependent on how you want to tune performance versus complexity. Unlike BoW, CBoW's vector size remains constant no matter how many words. Once trained on the traininig set (a large set of word vectors), you can just store them in a lookup table for future use.

Now in a look up table it can be used in deep learning architectures. For example, it can be used as the input vector for recurrent neural nets. It is also possible to use RNNs to learn even better word embeddings. Some other optimizations are possible that further reduce the model and training complexity such as representing the output words using Hierarchical Softmax, computing loss using Sparse Cross Entropy, et cetera.

#### Global Vectors for Word Representation (GloVe)
GloVe or global vectors for word representation is an approach of embedding that tries to directly optimize the vector representation of each word using co-occurrence statistics.

First, the probability of a word j appears in the context of word i. For example, what is the probability that the word "cup" would be in the context (within 1-2 neighboring words) of the word "coffee"? The words "cup" and "coffee" are often related so we would intuit that it would be have a relatively high probability.

Then, we count all such occurrences of i and j in our text collection, and then normalize a count to get a probability. Two random vectors are initialized for each word

1. Word as a context
2. Word as the target
Now, for any pair of words, ij, we want the dot product of their word vectors.

![coocurence_prob](./02_NLP_pipelines/images/08.PNG)

Using this as our goal and a suitable last function, we can iteratively optimize these word vectors. The result should be a set of vectors that capture the similarities and differences between individual words. If you look at it from another point of view, we are essentially factorizing the co-occurrence probability matrix into two smaller matrices. This is the basic idea behind GloVe. All that sounds good, but why co-occurrence probabilities?

Consider this table and probabilities:

|       |solid          |	water         |
|-------|---------------|-----------------|
|ice    |	P(solid|ice)|	P(water|ice)  |
|steam	|P(solid|steam) |	P(water|steam)|

Using our intuition one would come across "solid" more often in the context of "ice" than "steam" and "water" could occur in either context with roughly equal probability. And that is what we see in the co-occurance probabilities.

![coocurence_prob](./02_NLP_pipelines/images/09.PNG)

Given a large corpus, you'll find that the ratio of P solid given ice (P(solid|ice)) to P solid given steam (P(solid|steam)) is much greater than one, while the ratio of P water given ice (P(water|ice)) and P water given steam (P(water|steam))is close to one.

Thus, we see that co-occurrence probabilities already exhibit some of the properties we want to capture. In fact, one refinement over using raw probability values is to optimize for the ratio of probabilities. The co-occurence probability matrix is huge and the co-occurrence probability values are typically very low, so it makes sense to work with the log of these values.

I encourage you to read the paper that introduced GloVe to get a better understanding of this technique, called [GloVE: Global Vectors for Word Representations](https://nlp.stanford.edu/pubs/glove.pdf).

#### Embedding for Deep Learning
Word embeddings are fast becoming the de facto choice for representing words, especially for use and deep neural networks. In the distributional hypothesis, states that words that occur in the same contexts tend to have similar meanings. For example, consider these sentences:

> A: Would you like to have a cup of <blank>?
> B: I like my <blank> black.
> C: I need my morning <blank> before I can do anything.

By now you probably have a word to fill in the <blank>. Let's look at some follow up questions:

1. What would the blank be? "Tea" or "Coffee"
2. What words in the sentence gave you the context clue for the word?
    * "Cup"
    * "Black"
    * "Morning"

But it either "Tea" or "Coffee" could fill in the blanks and make sense. In these contexts, tea and coffee are actually similar. Therefore, when a large collection of sentences is used to learn in embedding, words with common context words tend to get pulled closer and closer together. Of course, there could also be contexts in which tea and coffee are dissimilar.

For example:

> A: <blank> grounds are great for composting.
> B: I prefer loose leaf <blank>.

A is clearly talking about "coffee grounds". While B is talking about "loose leaf tea".

We can capture these similarities and differences in the same embedding by adding another dimension. Words can be close along one dimension. For example "tea" and "coffee" are both breverages but differ in other ways. A dimension could captures all the variability among beverages.

In a human language, there are many more dimensions along which word meanings can vary. The more dimensions you can capture in your word vector, the more expressive that representation will be.

#### How many dimensions do you really need?
For example, a typical neural network architecture designed for an NLP task like word prediction could have a few hundred dimension in a word embedding layer. This might seem large but remember using one-heart encodings is as large as the size of the vocabulary, sometimes in tens of thousands of words.

You can also add learning embedding as part of the model training process and obtain a representation that captures the dimensions that are most relevant for your task. This often adds complexity so often we use a pre-trained embeddings (Word2Vec or GloVe) as a look-up unless your use case is very narrow like on for medical terminology. This will allow you to only train the layer specific to your task.

Compare this with the network architecture for a computer vision task, say, image classification, the raw input here is also very high dimensional. For example, even 128 by 128 Image contains over 16 thousand pixels. We typically use convolutional layers to exploit the spatial relationships and image data and reduce this dimensionality. Early stages and visual processing are often transferable across tasks, so it is common to use some pre-trained layers from an existing network, like Alex Net or BTG 16 and only learn the later layers. Come to think of it, using an embedding look up for NLP is not on like using pre-treated layers for computer vision. Both are great examples of transfer learning.

#### t-SNE
Word embeddings need to have high dimensionality to capture sufficient variations in natural language, but this makes them hard to visualize.

t-Distributed Stochastic Neighbor Embedding (t-SNE), is a dimensionality reduction technique that can map high dimensional vectors to a lower dimensional space. It's kind of like Principle Component Analysis (PCA), but it tries to maintain relative distances between objects, so that similar ones stay closer together while dissimilar objects stay further apart.

If we look at the larger vector space, we can discover meaningful groups of related words. Sometimes, that takes a while to realize why certain clusters are formed, but most of the groupings are very intuitive.

T-SNE also works on other kinds of data, such as images. For example, pictures from the Caltech 101 dataset organized into clusters that roughly correspond to class labels

* airplanes with blue sky
* sailboats of different shapes and sizes
* human faces

This is a very useful tool for better understanding the representation that a network learns and for identifying any bugs or other issues.

### Modeling
The final stage of the NLP pipeline is **modeling**, which includes designing a statistical or machine learning model, fitting its parameters to training data, using an optimization procedure, and then using it to make predictions about unseen data.

The nice thing about working with numerical features is that it allows you to choose from all machine learning models or even a combination of them.

Once you have a working model, you can deploy it as a web app, mobile app, or integrate it with other products and services. The possibilities are endless!


### NLP Pipeline Recap
We covered the 3 major steps of a NLP Pipeline and the ways to approach each of these steps for your given problem.

1. Text Processing
* Cleaning
* Normalization
* Tokenization
* Stop Word Removal
* Part of Speech Tagging
* Named Entity Recognition
* Stemming and Lemmatization

2. Feature Extraction
* Bag of Words
* TF-IDF
* Word Embeddings

3. Modeling

## Machine Learning Pipelines

### Introduction
Welcome to this lesson on ML Pipelines! Here, we'll cover:

* Advantages of Machine Learning Pipelines
* Scikit-learn Pipeline
* Scikit-learn Feature Union
* Pipelines and Grid Search
* Case Study

### Using Pipeline
Below, you'll find a simple example of a machine learning workflow where we generate features from text data using count vectorizer and tf-idf transformer, and then fit it to a random forest classifier. Before we get into using pipelines, let's first use this example to go over some scikit-learn terminology.

* **ESTIMATOR**: An estimator is any object that learns from data, whether it's a classification, regression, or clustering algorithm, or a transformer that extracts or filters useful features from raw data. Since estimators learn from data, they each must have a `fit` method that takes a dataset. In the example below, the `CountVectorizer`, `TfidfTransformer`, and `RandomForestClassifier` are all estimators, and each have a `fit` method.
* **TRANSFORMER**: A transformer is a specific type of estimator that has a `fit` method to learn from training data, and then a `transform` method to apply a transformation model to new data. These transformations can include cleaning, reducing, expanding, or generating features. In the example below, `CountVectorizer` and `TfidfTransformer` are transformers.
* **PREDICTOR**: A predictor is a specific type of estimator that has a `predict` method to predict on test data based on a supervised learning algorithm, and has a `fit` method to train the model on training data. The final estimator, `RandomForestClassifier`, in the example below is a predictor.

In machine learning tasks, it's pretty common to have a very specific sequence of transformers to fit to data before applying a final estimator, such as this classifier. And normally, we'd have to initialize all the estimators, `fit` and `transform` the training data for each of the transformers, and then fit to the final estimator. Next, we'd have to call `transform` for each transformer again to the test data, and finally call `predict` on the final estimator.

#### Without pipeline:

```python
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = RandomForestClassifier()

# train classifier
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
clf.fit(X_train_tfidf, y_train)

# predict on test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)
```

Fortunately, you can actually automate all of this fitting, transforming, and predicting, by chaining these estimators together into a single estimator object. That single estimator would be **scikit-learn's Pipeline**. To create this pipeline, we just need a list of `(key, value)` pairs, where the key is a string containing what you want to name the step, and the value is the estimator object.

#### Using pipeline:

```python
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier()),
])

# train classifier
pipeline.fit(Xtrain)

# evaluate all steps on test set
predicted = pipeline.predict(Xtest)
```

Now, by fitting our pipeline to the training data, we're accomplishing exactly what we would by fitting and transforming each of these steps to our training data one by one. Similarly, when we call `predict` on our pipeline to our test data, we're accomplishing what we would by calling `transform` on each of our transformer objects to our test data and then calling `predict` on our final estimator. Not only does this make our code much shorter and simpler, it has other great advantages, which we'll cover in the next video.

Note that every step of this pipeline has to be a transformer, except for the last step, which can be of an estimator type. Pipeline takes on all the methods of whatever the last estimator in its sequence is. For example, here, since the final estimator of our pipeline is a classifier, the pipeline object can be used as a classifier, taking on the `fit` and `predict` methods of its last step. Alternatively, if the last estimator was a transformer, then pipeline would be a transformer.

#### Advantages of Using Pipeline
Below are two videos explaining the advantages of using scikit-learn's Pipeline as seen in the previous video.

1. Simplicity and Convencience
* **Automates repetitive steps** - Chaining all of your steps into one estimator allows you to fit and predict on all steps of your sequence automatically with one call. It handles smaller steps for you, so you can focus on implementing higher level changes swiftly and efficiently.
* **Easily understandable workflow** - Not only does this make your code more concise, it also makes your workflow much easier to understand and modify. Without Pipeline, your model can easily turn into messy spaghetti code from all the adjustments and experimentation required to improve your model.
* **Reduces mental workload** - Because Pipeline automates the intermediate actions required to execute each step, it reduces the mental burden of having to keep track of all your data transformations. Using Pipeline may require some extra work at the beginning of your modeling process, but it prevents a lot of headaches later on.

2. Optimizing Entire Workflow
* **GRID SEARCH**: Method that automates the process of testing different hyper parameters to optimize a model.
* By running grid search on your pipeline, you're able to optimize your entire workflow, including data transformation and modeling steps. This accounts for any interactions among the steps that may affect the final metrics.
* Without grid search, tuning these parameters can be painfully slow, incomplete, and messy.

3. Preventing Data leakage
* Using Pipeline, all transformations for data preparation and feature extractions occur within each fold of the cross validation process.
* This prevents common mistakes where you’d allow your training process to be influenced by your test data - for example, if you used the entire training dataset to normalize or extract features from your data.

#### Pipelines and Feature Unions
* **FEATURE UNION**: [Feature union](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) is a class in scikit-learn’s Pipeline module that allows us to perform steps in parallel and take the union of their results for the next step.
* A **pipeline** performs a list of steps in a linear sequence, while a feature union performs a list of steps in parallel and then combines their results.
* In more complex workflows, multiple feature unions are often used within pipelines, and multiple pipelines are used within feature unions.

#### Using Feature Union
Taking the example from the previous video, let's say you wanted to extract two different kinds of features from the same text column - tfidf values, and the length of the text. Your first approach might be to create an additional column from the `text` column called `text_length` like this. Then both `text` and `text_length` can be part of your feature matrix. But now your pipeline would break. You can't run `CountVectorizer` on NumPy arrays of strings and integers.

```python
df['txt_length'] = df['text'].apply(len)
X = df[['text', 'txt_length']].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier()),
])

# train classifier
pipeline.fit(Xtrain)

# predict on test data
predicted = pipeline.predict(Xtest)
```

Let's say you had a custom transformer called `TextLengthExtractor`. Now, you could leave `X_train` as just the original text column, if you could figure out how to add the text length extractor to your pipeline. If only you could fit it on the original text data, rather than the output of the previous transformer. But you need both the outputs of `TfidfTransformer` and `TextLengthExtractor` to feed into the classifier as input.

```python
X = df['text'].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('txt_length', TextLengthExtractor()),
    ('clf', RandomForestClassifier()),
])

# train classifier
pipeline.fit(Xtrain)

# predict on test data
predicted = pipeline.predict(Xtest)
```
* Feature unions are super helpful for handling these situations, where we need to run two steps in parallel on the same data and combine their results to pass into the next step.
* Like pipelines, feature unions are built using a list of `(key, value)` pairs, where the key is the string that you want to name a step, and the value is the estimator object. Also like pipelines, feature unions combine a list of estimators to become a single estimator. However, a feature union runs its estimators in parallel, rather than in a sequence as a pipeline does. In this example, the estimators run in parallel are `nlp_pipeline` and `text_length`. Notice we use a pipeline in this feature union to make sure the count vectorizer and tfidf transformer steps are still running in sequence.

```python
X = df['text'].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('features', FeatureUnion([

        ('nlp_pipeline', Pipeline([
            ('vect', CountVectorizer()
            ('tfidf', TfidfTransformer())
        ])),

        ('txt_len', TextLengthExtractor())
    ])),

    ('clf', RandomForestClassifier())
])

# train classifier
pipeline.fit(Xtrain)

# predict on test data
predicted = pipeline.predict(Xtest)
```
* Now, our pipeline doesn't break and uses both features! This would be equivalent to this code.

```python
vect = CountVectorizer()
tfidf = TfidfTransformer()
txt_len = TextLengthExtractor()
clf = RandomForestClassifier()

# train classifier
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)

X_train_len = txt_len.fit_transform(X_train)
X_train_features = hstack([X_train_tfidf, X_train_len])
clf.fit(X_train_features, y_train)

# predict on test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)

X_test_len = txt_len.transform(X_test)
X_test_features = hstack([X_test_tfidf, X_test_len])
y_pred = clf.predict(X_test_features)
```
* The tfidf transformer and the text length extractor are fit to the input data, in this case the raw data, independently. They are then performed in parallel, and their outputs are combined and passed to the next estimator, in this case, the classifier.

Read more about feature unions in Scikit-learn's [user guide](http://scikit-learn.org/stable/modules/pipeline.html#feature-union).
