# Daily Stopwatch Data Science: ETL with petl

Note: this is a part of Insight Data Science. 

## Stage 1: What is ETL?

ETL means Extract, Tranform and Load. It can mean data pipelinig from primary sources to data warehouse. For data science, it can mean data cleaning.

## Stage 2: petl

petl is a Python package to do this task. Check more information [here](https://petl.readthedocs.io/en/latest/intro.html#ipython-notebook-integration).

In [4]:
!pip install petl



In [5]:
#Set up the environment
import petl as etl

In [6]:
#Set up example data
example_data = """foo,bar,baz
a,1,3.4
b,2,7.4
c,6,2.2
d,9,8.1
"""
with open('example.csv', 'w') as f:
     f.write(example_data)

In [8]:
#Set up pipeline
table1 = etl.fromcsv('example.csv')
table2 = etl.convert(table1, 'foo', 'upper')
table3 = etl.convert(table2, 'bar', int)
table4 = etl.convert(table3, 'baz', float)
table5 = etl.addfield(table4, 'quux', lambda row: row.bar * row.baz)

In [9]:
#peek at the first 5 rows
etl.look(table5)

+------+-----+-----+--------------------+
| foo  | bar | baz | quux               |
| u'A' |   1 | 3.4 |                3.4 |
+------+-----+-----+--------------------+
| u'B' |   2 | 7.4 |               14.8 |
+------+-----+-----+--------------------+
| u'C' |   6 | 2.2 | 13.200000000000001 |
+------+-----+-----+--------------------+
| u'D' |   9 | 8.1 |  72.89999999999999 |
+------+-----+-----+--------------------+

## Stage 3: Other useful tools

### Tokenizing words

The famous one is nltk. There are tricks to remove stop words and so on. [Here](https://gist.github.com/ameyavilankar/10347201) is an example.

### Read/Write files

[Here](http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python) are basic read/write files as txt file. Some specific ones included ```csv``` or pandas style read /write.

### Other resources

[Cleaning data in Python](http://data.library.utoronto.ca/cleaning-data-python) by U of Toronto.

[Scraping tweets using Python](https://data.library.utoronto.ca/scraping-tweets-using-python) by U of Toronto.

[Handy Python Libraries for Formatting and Cleaning Data](https://blog.modeanalytics.com/python-data-cleaning-libraries/) by Mode Analytics.

[Doing Data Science: A Kaggle Walkthrough – Cleaning Data](http://www.kdnuggets.com/2016/03/doing-data-science-kaggle-walkthrough-cleaning-data.html/2) by Kaggle/ KDnuggets