# Data Engineering

**Data engineers** gather data from various sources, process, combine and manipulate the data so it can be easily accessed and used. This whole process from gathering data to getting it ready to use can be automated by software programs, also called **data pipelines**.

Data engineers are responsible for gather data from various sources, process, and combine and the data so it can be easily accessed and used. This process is often automated by data pipelines.

As a data scientist, it's important to have some level of skill in data engineering despite the size of the company you work in.

In larger companies, although there are dedicated data engineers, data scientists need to communicate with data engineers about the data they need for the data science project. In smaller companies, data scientists may need to take some of the responsibility of a data engineer.

## Data Pipelines: ETL vs ELT
Data pipeline is a generic term for moving data from one place to another. For example, it could be moving data from one server to another server.

### ETL
An ETL pipeline is a specific kind of data pipeline and very common. ETL stands for Extract, Transform, Load. Imagine that you have a database containing web log data. Each entry contains the IP address of a user, a timestamp, and the link that the user clicked.

What if your company wanted to run an analysis of links clicked by city and by day? You would need another data set that maps IP address to a city, and you would also need to extract the day from the timestamp. With an ETL pipeline, you could run code once per day that would extract the previous day's log data, map IP address to city, aggregate link clicks by city, and then load these results into a new database. That way, a data analyst or scientist would have access to a table of log data by city and day. That is more convenient than always having to run the same complex data transformations on the raw web log data.

Before cloud computing, businesses stored their data on large, expensive, private servers. Running queries on large data sets, like raw web log data, could be expensive both economically and in terms of time. But data analysts might need to query a database multiple times even in the same day; hence, pre-aggregating the data with an ETL pipeline makes sense.

### ELT
ELT (Extract, Load, Transform) pipelines have gained traction since the advent of cloud computing. Cloud computing has lowered the cost of storing data and running queries on large, raw data sets. Many of these cloud services, like Amazon Redshift, Google BigQuery, or IBM Db2 can be queried using SQL or a SQL-like language. With these tools, the data gets extracted, then loaded directly, and finally transformed at the end of the pipeline.

However, ETL pipelines are still used even with these cloud tools. Oftentimes, it still makes sense to run ETL pipelines and store data in a more readable or intuitive format. This can help data analysts and scientists work more efficiently as well as help an organization become more data driven.

## World Bank Data
This lesson uses data from the World Bank. The data comes from two sources:

1. [World Bank Indicator Data](https://data.worldbank.org/indicator) - This data contains socio-economic indicators for countries around the world. A few example indicators include population, arable land, and central government debt.
2. [World Bank Project Data](https://datacatalog.worldbank.org/dataset/world-bank-projects-operations) - This data set contains information about World Bank project lending since 1947.

Both of these data sets are available in different formats including as a csv file, json, or xml. You can download the csv directly or you can use the World Bank APIs to extract data from the World Bank's servers. You'll be doing both in this lesson.

## Summary of the data file types you'll work with
### CSV files
CSV stands for comma-separated values. These types of files separate values with a comma, and each entry is on a separate line. Oftentimes, the first entry will contain variable names. Here is an example of what CSV data looks like. This is an abbreviated version of the first three lines in the World Bank projects data csv file.
```
id,regionname,countryname,prodline,lendinginstr
P162228,Other,World;World,RE,Investment Project Financing
P163962,Africa,Democratic Republic of the Congo;Democratic Republic of the Congo,PE,Investment Project Financing
```
### JSON
JSON is a file format with key/value pairs. It looks like a Python dictionary. The exact same CSV file represented in JSON could look like this:
```json
[{"id":"P162228","regionname":"Other","countryname":"World;World","prodline":"RE","lendinginstr":"Investment Project Financing"},{"id":"P163962","regionname":"Africa","countryname":"Democratic Republic of the Congo;Democratic Republic of the Congo","prodline":"PE","lendinginstr":"Investment Project Financing"},{"id":"P167672","regionname":"South Asia","countryname":"People\'s Republic of Bangladesh;People\'s Republic of Bangladesh","prodline":"PE","lendinginstr":"Investment Project Financing"}]
```
Each line in the data is inside of a squiggly bracket {}. The variable names are the keys, and the variable values are the values.

There are other ways to organize JSON data, but the general rule is that JSON is organized into key/value pairs. For example, here is a different way to represent the same data using JSON:
```json
{"id":{"0":"P162228","1":"P163962","2":"P167672"},"regionname":{"0":"Other","1":"Africa","2":"South Asia"},"countryname":{"0":"World;World","1":"Democratic Republic of the Congo;Democratic Republic of the Congo","2":"People\'s Republic of Bangladesh;People\'s Republic of Bangladesh"},"prodline":{"0":"RE","1":"PE","2":"PE"},"lendinginstr":{"0":"Investment Project Financing","1":"Investment Project Financing","2":"Investment Project Financing"}}
```
### XML
Another data format is called XML (Extensible Markup Language). XML is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set. Here is what this same data would look like as XML.

```xml
<ENTRY>
  <ID>P162228</ID>
  <REGIONNAME>Other</REGIONNAME>
  <COUNTRYNAME>World;World</COUNTRYNAME>
  <PRODLINE>RE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
<ENTRY>
  <ID>P163962</ID>
  <REGIONNAME>Africa</REGIONNAME>
  <COUNTRYNAME>Democratic Republic of the Congo;Democratic Republic of the Congo</COUNTRYNAME>
  <PRODLINE>PE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
<ENTRY>
  <ID>P167672</ID>
  <REGIONNAME>South Asia</REGIONNAME>
  <COUNTRYNAME>People's Republic of Bangladesh;People's Republic of Bangladesh</COUNTRYNAME>
  <PRODLINE>PE</PRODLINE>
  <LENDINGINSTR>Investment Project Financing</LENDINGINSTR>
</ENTRY>
```

ML is falling out of favor especially because JSON tends to be easier to navigate; however, you still might come across XML data. The World Bank API, for example, can return either XML data or JSON data. From a data perspective, the process for handling HTML and XML data is essentially the same.

### SQL databases
SQL databases store data in tables using [primary and foreign keys](https://docs.microsoft.com/en-us/sql/relational-databases/tables/primary-and-foreign-key-constraints?view=sql-server-2017). In a SQL database, the same data would look like this:

|id	| regionname	|countryname|	prodline|	lendinginstr |
|---|-------------|-----------|---------|--------------|
|P162228	|Other	|World;World	|RE	|Investment Project Financing|
|P163962	|Africa	|Democratic Republic of the Congo;Democratic Republic of the Congo	|PE	|Investment Project Financing|
|P167672	|South Asia|	People's Republic of Bangladesh;People's Republic of Bangladesh	|PE|	Investment Project Financing|

### Text Files
This course won't go into much detail about text data. There are other Udacity courses, namely on natural language processing, that go into the details of processing text for machine learning.

Text data present their own issues. Whereas CSV, JSON, XML, and SQL data are organized with a clear structure, text is more ambiguous. For example, the World Bank project data country names are written like this
```
Democratic Republic of the Congo;Democratic Republic of the Congo
```
In the World Bank Indicator data sets, the Democratic Republic of the Congo is represented by the abbreviation "Congo, Dem. Rep." You'll have to clean these country names to join the data sets together.

### Extracting Data from the Web
In this lesson, you'll see how to extract data from the web using an APIs (Application Programming Interface). APIs generally provide data in either JSON or XML format.

Companies and organizations provide APIs so that programmers can access data in an official, safe way. APIs allow you to download, and sometimes even upload or modify, data from a web server without giving you direct access.




## Transform

Transforming data means getting data ready for a machine learning algorithm or other data science projects. Transforming data can involve a wide range of processes such as combining data from different sources, cleaning the data, engineering new features. Hence, you transform the original datasets to create a new dataset that is ready for use.

### Combining Data

Oftentimes you need to combine data from different datasets. The process of combining data can be complicated and very different from case to case.

Datasets can be from different sources and have different content so you need to check the data closely before combining them. Or they may be in different formats meaning you will need to transform data from one format to another first.

In the next part of the lesson, you will have an exercise for combining data from different sources using the Python Pandas package. If you are unfamiliar with the Pandas, you can review the learning resources below to prepare yourself before heading to the exercise.

### Cleaning Data

Dirty data refers to data that contains errors. Data error can come from many sources including:

* data entry mistakes
* duplicate data
* incomplete records
* inconsistencies between dataset

It's very important to audit data after obtaining it, otherwise, the data science projects will not perform well or even worse, give you wrong results.

In the next section, you will have an exercise on data cleaning. Your job is to clean the data so the country names across different datasets are consistent.

## Missing Data
In the video, I say that a machine learning algorithm won't work with missing values. This is essentially correct; however, there are a couple of situations where this isn't quite true. For example, if you had a categorical variable, you could keep the NULL value as one of the options.

Like if theme_2 could have a value of agriculture, banking, or NULL, you might encode this variable as 0, 1, 2 where the value 2 stands in for the NULL value. You could do something similar for one-hot encoding where the theme_2 variable becomes 3 true/false features: theme_2_agriculture, theme_2_banking, theme_2_NULL. You could have to make sure that this improves your model performance.

There are also implementations of some machine learning algorithms, such as [gradient boosting](https://xgboost.readthedocs.io/en/latest/) decision trees that can [handle missing values](https://github.com/dmlc/xgboost/issues/21).

### Missing Data - Delete
There are two ways to handle missing values:

* Delete data
* Fill missing values (also called imputation)

You can delete a feature with a large percentage of missing values because it's unlikely to contribute to your machine learning model unless somehow you can find the missing values from another source. If you think deleting a row consisting of a lot of missing values won't affect your result, you can also delete it.

However, deleting missing values is not the only option. In the next part, we will talk about another option - fill in missing values.

### Missing Data - Inpute
The process of filling in missing values is called `imputation`. Some of the imputation techniques are mean substitution, forward fill, and backward fill.

##### Mean Substitution
The mean substitution method fills in the missing values using the `column mean`. You can even group the data first then filling the data by the means in different groups. Alternatively, you can fill in the missing values by `median` or `mode`.

##### Forward Fill or Backward Fill
Forward fill or backward fill works for `ordered` or `times series` data. In both methods, you can use neighboring cells to fill in the missing value.

To use these methods, you should always make sure your data is `sorted` by a timestamp or in a meaningful way. With the forward fill, values are pushed forward `down` to replace any missing values. With backward fill, values move `up` to fill missing values.

## Duplicate Data
Data duplication is obvious when the same row shows up more than once and you can simply remove the duplicate data using the Pandas drop duplicates method. However, duplicate data can sometimes be trickier to find which requires you to comb through the data to recognize and eliminate duplications.

## Dummy Variables

### When to Remove a Feature
As mentioned in the video, if you have five categories, you only really need four features. For example, if the categories are "agriculture", "banking", "retail", "roads", and "government", then you only need four of those five categories for dummy variables. This topic is somewhat outside the scope of a data engineer.

In some cases, you don't necessarily need to remove one of the features. It will depend on your application. In regression models, which use linear combinations of features, removing a dummy variable is important. For a decision tree, removing one of the variables is not needed.


## Outliers

### Outliers - How to Find Them
**Outliers** are data points that are far away from the rest of the data. Outliers can be caused by errors such as data entry errors or processing mistakes. However, outliers can also be legitimate data.

There are some methods to detect outliers:

**Data visualization**
When working with one or two-dimensional data, you can visualize data to detect outliers. The data points far away from the rest of the data on the plot can potentially be outliers.

**Statistical methods**
Statistical properties of data like means, standard deviations, quantiles can be used to identify outliers, for example, the z-score from the norma distribution and the Tukey method.

**Machine learning methods**
When it comes to high-dimensional data, you can use machine learning techniques such as PCA to reduce the data dimensions. Then you can use the methods discussed before to detect outliers.

Another way is to cluster the data then calculate the distance from each data point to the cluster centroid. A large distance may indicate an outlier.

Next, you will get some practice on identifying outliers.

#### Outlier Detection Resources [Optional]
Here are a couple of links to outlier detection processes and algorithms. Since this is an ETL course rather than a statistics course, you don't need to read these in order to complete the lesson.

* [scikit-learn novelty and outlier detection](http://scikit-learn.org/stable/modules/outlier_detection.html)
* [statistical and machine learning methods for outlier detection](https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561)

#### Tukey Rule
* Find the first quartile (ie .25 quantile)
* Find the third quartile (ie .75 quantile)
* Calculate the inter-quartile range (Q3 - Q1)
* Any value that is greater than Q3 + 1.5 * IQR is an outlier
* Any value that is less than Qe - 1.5 * IQR is an outlier

### Outliers - What to do
Now you've identified the outliers then what next? The answer varies depending on how the outlier affects your machine learning model. If removing the outlier improves your machine learning model, maybe it's best to delete the outliers. But if the outlier has a special meaning so it has to be taken into account in your model, then it's best to keep it or find other solutions. In another case, if removing the outlier has little or no effect on your results, you can leave the outliers as it.

## Scaling Data
Numerical data comes in all different distribution patterns and ranges. This can be an issue for machine learning algorithms that calculate Euclidean distance between points, such as PCA, linear regression with gradient descent. This issue can be solved by performing data **normalization** or **feature scaling**. Two common ways to scale features are **rescaling** and **standardization**.

### Rescaling / Normalization
With rescaling, the distribution of the data remains the same but the range changes to 0-1. You scale down the data range so the minimum value is zero and the maximum value is one.

To normalize data, you take a feature, like gdp, and use the following formula

$x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$

where 
* x is a value of gdp
* x_max is the maximum gdp in the data
* x_min is the minimum GDP in the data


### Standardization
With standardization, the general shape of the distribution remains the same but the mean and standard deviation are standardized. You transform the data so that it has a mean of zero and a standard deviation of one.

## Feature Engineering
Feature engineering refers to the process to create new features and it is a very broad topic. Data engineers wouldn't necessarily decide what feature is a good one to engineer but they might be asked to write code that transforms data into a new feature.

In the video above, we looked at taking the polynomial of data as a feature engineering example. Using the two features `x` and `y` , you can create many new features like `x^2`, `x^3`, `xy`, `y^2`, `x^2y`, and so on. Creating new features is especially useful if your model is underfitting in which existing features can't capture the trend.



## Load
**Load** refers to store the data in a database. There are many options for data storage.

* Relational database like SQL works well with structured data
* CSV files work well with data that fits in a Pandas DataFrame

Part of the data engineers' job is to know how to work with different data storage options.

#### Links to Other Data Storage Systems
* ranking of database engines
* Redis
* Cassandra
* Hbase
* MongoDB

## Lesson Summary

In this lesson, you've learned to create ETL pipelines: you pull the data from different sources, transform the data using various techniques and load the transformed data in a data storage.

#### I. Extract data from different sources
* csv files
* json files
* APIs

#### II. Transform data
* combining data from different sources
* data cleaning
* data types
* parsing dates
* file encodings
* missing data
* duplicate data
* dummy variables
* remove outliers
* scaling features
* engineering features

#### III. Load data
* send the transformed data to a database

# NLP Pipelines

In this lesson, you'll be introduced to some of the steps involved in a NLP pipeline:

1. Text Processing
    * Cleaning
    * Normalization
    * Tokenization
    * Stop Word Removal
    * Part of Speech Tagging
    * Named Entity Recognition
    * Stemming and Lemmatization
2. Feature Extraction
    * Bag of Words
    * TF-IDF
    * Word Embeddings
3. Modeling

## How NLP Pipelines Work
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

* **Text Processing**: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
* **Feature Extraction**: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
* **Modeling**: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.


## Text Processing

### Stage 1: Text Processing
The first chunk of this lesson will explore the steps involved in text processing, the first stage of the NLP pipeline.

You'll prepare text data from different sources with the following text processing steps:

1. **Cleaning** to remove irrelevant items, such as HTML tags
2. **Normalizing** by converting to all lowercase and removing punctuation
3. Splitting text into words or **tokens**
4. Removing words that are too common, also known as **stop words**
5. Identifying different **parts of speech** and **named entities**
6. Converting words into their dictionary forms, using **stemming and lemmatization**

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

#### Why Do We Need to Process Text?
* **Extracting plain text**: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
* **Reducing complexity**: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

### Cleaning
Let's walk through an example of cleaning text data from a popular source - the web. You'll be introduced to helpful tools in working with this data, including the **requests** library, **regular expressions**, and **Beautiful Soup**.

```python
import requests

# Fetch a web page
r= requests.get("https://www.udacity.com/courses/all")
print(r.text)
```
> Outputs Entire HTML Source

This seems inefficient so let's use the `BeautifulSoup` library to help.
```python
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup Library
soup = BeautifulSoup(r.text, "html5lib") # pass in raw web page and create a soup object
print(soup.get_text()) # extract plain text from all the tags
```
> Outputs No tags but there is still JavaScript and tons of white space.

Go to the webpage and take a Inspect the element you are interested in. Let's first look at the `course-summary-card`
```python
# Find all course summaries
summaries = soup.find_all("div", class_="course-summary-card")
summaries[0]
```
> Outputs A list of all the `course-summary-card` values and will print out the 1st item in that list.

Looking at the summary we find that the title is nested in an `<a>` tag inside an `<h3>` tag.
```python
# Extract title
summaries[0].select_one("h3 a").get_text().strip()
```
> Output:  Intro to Programming Nanodegree

Going back we can also see the summary is in a `<div data-course-short-summary="">` tag.
```python
# Extract description
summaries[0].select_one("div[data-course-short-summary]").get_text().strip()
```
> Output: 
> *Udacity's Intro to Programming is your first step towards careers in Web and App Development, Machine Learning, Data Science, AI, and more! This program is perfect for beginners.*

```python
# Find all course summaries, extract title and descriptiion
courses = []
summaries = soup.find_all("div", class_="course-summary-card")
for summary in summaries: 
  title = summary.select_one("h3 a").get_text().strip() 
  description = summary.select_one("div[data-course-short-summary]").get_text().strip()
  print("***", title, "***")
  print(description)

print(len(courses), "course summaries found. Sample:") 
print(courses[0][0])
print(courses[0][1])
```

> Output: 193 course summaries found. Sample:
Intro to Programming Nanodegree
Udacity's Intro to Programming is your first step towards careers in Web and App Development, Machine Learning, Data Science, AI, and more! This program is perfect for beginners.

#### Documentation for Python Libraries:
* [Requests](http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)
* [Regular Expressions](https://docs.python.org/3/library/re.html)
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Normalization
Plain text is still human language with all its variations and bells and whistles so in normalization, we will try to reduce some of that complexity.

#### Capitalization Removal
In the English language, the starting letter of the first word in any sentence is usually capitalized. All caps are sometimes used for emphasis and for stylistic reasons. While this is convenient for a human reader from the standpoint of a machine learning algorithm, it does not make sense to differentiate between variations that mean the same thing:

* Car
* car
* CAR

Therefore, we usually convert every letter in our text to a common case, usually lowercase, so that each word is represented by a unique token.

Here's some sample text from a movie review:
> The first time you see The Second Renaissance it may look boring. Look at it at least twice and definetly watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?

If we have the review stored in a variable called text, converting it to lowercase is a simple call to the lower method in Python.
```python
# Conver to lowercase
text = text.lower()
print(text)
```
> **Output**
the first time you see the second renaissance it may look boring. look at it at least twice and definetly watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?

Note all the letters that were changed.

#### Punctation Removal
Other languages may or may not have a case equivalent but similar principles may apply depending on your NLP task, you may want to remove special characters like periods, question marks, and exclamation points from the text and only keep letters of the alphabet and maybe numbers.

This is useful when looking at text documents as a whole in applications like document classification and clustering where the low level details doesn't affect the application.

To do this we can use a regular expression that matches everything that is not a lowercase A to Z, uppercase A is Z, or digits zero to nine, and replaces them with a space.

```python
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) # Anything that isn't A through Z or 0 through 9 will be replaced by a space
print(text)
```
> **Output**
the you he first time you see the second renaissance it may look boring look at it at least twice and definetly watch part 2 it will change your view of the matrix are the human people the ones who started the war is ai a bad thing

This approach avoids having to specify all punctuation characters, but you can use other regular expressions as well.

Lowercase conversion and punctuation removal are the two most common text normalization steps. If and when you apply these steps depends on your end goal and how you design your pipeline.

> It is better to replace the punctuation with a space than tho remove them. Replacing with a space makes sure that words don't get concatenated together, in case the original text did not have a space before or after the punctuation.

### Tokenization
Token is a fancy term for a symbol that holds some meaning and is not typically split up any further.

In natural language processing, our tokens are usually individual words. This means that the process of tokenization is simply splitting a sentence into a sequence of words. The simplest way to do this is using the split method which returns a list of words.

> **Input**
the you he first time you see the second renaissance it may look boring look at it at least twice and definetly watch part 2 it will change your view of the matrix are the human people the ones who started the war is ai a bad thing

```python
# Split text into tokens (words)
words = text.split()
print(words)
```
> **Output**
['the', 'you', 'he', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definetly', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']

Notice that it splits on whitespace characters (spaces, tabs, new lines, etc.) and will automatically ignoring two or more whitespace characters in a sequence, so it doesn't return blank strings. This can be further adjusted using optional parameters.

### Natural Language Toolkit (NLTK)
So far, we've only been using Python's built-in functionality, but some of these operations are much easier to perform using a library like Natural Language Toolkit (NLTK).

The most common approach for splitting up text in NLTK is to use the word tokenized function from nltk.tokenize.
```python
from nltk.tokenize import word_tokenize

#Split text into words using NLTK
words = word_tokenize(text)
print(words)
```
This performs the same task as split but has a few more features than the split method.

For example if we gave it
> Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.

it would return the following

> ['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']

You'll notice that the punctuations are treated differently based on their position. For example, 'Dr.' has been tokenized as one word rather than being tokenized into two seperate entities 'Dr' and '.'. NLTK is using some rules or patterns to decide what to do with each punctuation.

#### NLTK's Sentence Tokenization
There are instances you may need to split a longer docuument into sentence, this is something that might be done for translations. You can achieve this with NLTK using sent tokenize.

```python
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences) 
```
> **Output**
['Dr.Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']

Now one could tokenize based on words if needed.

NLTK provide several other tokenizers and here are some of them:

* regular expression based tokenizer that can remove punctuation and perform tokenization in a single step
* tweet tokenizer that is aware of twitter handles, hash tags, and emoticons
Reference:

* nltk.tokenize package: http://www.nltk.org/api/nltk.tokenize.html