Skip to content

underthecurve/pandas-data-cleaning-tricks

Repository files navigation

Tricks for cleaning your data in Python using pandas

In 2017 I gave a talk called "Tricks for cleaning your data in R" which I presented at the Data+Narrative workshop at Boston University. The repo with the code and data I used for the talk was pretty well-received, so I figured I'd try to do some of the same stuff in Python using pandas.

Disclaimer: when it comes to data stuff, I'm much better with R, especially the tidyverse set of packages, than with Python, but in my last job I used Python's pandas library to do a lot of data processing since Python was the dominant language there. Please feel free to let me know if there are better ways to do things!

Links to install Python, pandas and Jupyter notebook

  • Python: website for Python
  • pandas: website for the pandas library
  • Jupyter: website for Project Jupyter, whose interactive notebook this tutorial was written in

Files included

Annotated code and step-by step instructions for the workshop

Python code

Underlying data needed to run the Python code

How to follow this tutorial

  • You can clone or download this repository by clicking on the green button above, "Clone or download"
  • Follow along by reading the .ipynb file online or printing the .pdf file out by clicking the Github links above

Questions / Feedback?

ychristinezhang at gmail dot com

or on Twitter

@christinezhang

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.