# SI 330: Data Manipulation 
## 01 - Introduction
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Overview of Today
* Teaching team introductions
* Why this course?
* What you’ll learn
* Syllabus walk-through
* Introduction to Data Manipulation
* Introduction to Anaconda and Jupyter Notebooks

# About your instructor: Dr. Chris Teplovs
* Originally from Canada (and currently  living there)
* Ph.D. in Curriculum, Teaching and Learning  from the University of Toronto
* Postdoctoral Fellow at Copenhagen Business School
* Visiting Associate Research Professor, École Normale Supérieure de Cachan, France
* Lead Developer, Office of Academic Innovation
* Lecturer & Research Scientist, School of Information

# About the teaching team
* GSI: Johan Mosquera
* IA: Frankie Antenucci

# Icebreaker: Two truths and a lie
1. Arrange yourselves into groups of about 5 people.  
1. Think of three statements about yourself. Two must be true statements, and one must be false. 
1. For each person, you share the three statements (in any order) to the group. 
1. The goal of the icebreaker game is to determine which statement is false. 


# Why this course?

* About 80% of initial work on a data science project involves **data manipulation**
 * accessing, converting, transforming, cleaning, filtering, aggregating, grouping, summarizing
* Data analysis is tightly coupled with data manipulation, especially when iterating

# Skills you'll learn in this course

* How to get / read / gather / fetch / crawl data
* How to convert data to and from important formats
* Basic computation and manipulation of the data, including filtering and sorting
* Initial methods to explore and visualize to gain insights
* How to apply Python coding and packages to solve the above problems

# Tools you'll learn in this course

* Python core functionality
* Jupyter notebooks
* Python packages
 * pandas, matplotlib, re, NLTK, pyspark
* Amazon Web Service (AWS)
 * Simple Storage Service (S3)
 * Lambda
 * API Gateway
* Spark

# Course plan: a smorgasbord of data manipulation techniques
![](assets/smorgasbord.png)

# Syllabus walk-through

[Canvas](https://umich.instructure.com/courses/267556)

[Syllabus](https://docs.google.com/document/d/1PcXeEiuVn_0EKH0kn6rWadJAfRPRW1PRB9jyiN52XF8/edit)



# Class format

* meeting face-to-face twice a week
* series of 20 in-class notebooks
* about 5 "segments" per class

# Late policy
* You have 3 penalty-free late days
* One late day = one 24-hour period after due date
* No fractional late days: all or nothing
* 25% penalty per late day after late days used up
* You don't need to explain late days
* We track them for you
* Submit late assignments via Canvas (like usual)

# Original work policy
Unless otherwise specified in an assignment all submitted work must be your own, original work. Any excerpts, statements, or phrases from the work of others must be clearly identified as a quotation, and a proper citation provided. Any violation of the School’s policy on Academic and Professional Integrity (stated in the Master’s and Doctoral Student Handbooks) will result in serious penalties, which might range from failing an assign­ment, to failing a course, to being expelled from the program. Violations of academic and professional integrity will be reported to UMSI Student Affairs. Consequences impacting assignment or course grades are determined by the faculty instructor; additional sanctions may be imposed by the assistant dean for academic and student affairs. 

# Accommodations for students with disabilities
If you think you need an accommodation for a disability, please let me know at your earliest convenience. Some aspects of this course, the as­signments, the in-class activities, and the way we teach may be modified to facilitate your participation and progress. As soon as you make me aware of your needs, we can work with the Oﬃce of Services for Students with Disabilities (SSD) to help us determine appropriate accommoda­tions. SSD (734-763-3000; ssd.umich.edu/) typically rec­ommends accommodations through a Verified Individualized Services and Accommodations (VISA) form. I will treat any information that you provide in as confidential a manner as possible. 

# Student mental health and wellbeing
The University of Michigan is committed to advancing the mental health and wellbeing of its students, while acknowledging that a variety of issues, such as strained relationships, increased anxiety, alcohol/drug problems, and depression, directly impacts students' academic performance.
If you or someone you know is feeling overwhelmed, depressed, and/or in need of support, services are available. For help, contact Counseling and Psychological Services (CAPS) at (734) 764-8312 and https://caps.umich.edu/ during and after hours, on weekends and holidays or through its counselors physically located in schools on both North and Central Campus. You may also consult University Health Service (UHS) at (732) 764-8320 and https://www.uhs.umich.edu/mentalhealthsvcs, or for alcohol or drug concerns, see www.uhs.umich.edu/aodresources.

# Questions?

# Getting set up
* [Canvas](https://umich.instructure.com/courses/267556)
* [Slack](https://si330wn2019.slack.com)
* Jupyter ([Anaconda](https://www.anaconda.com))

## Canvas
* institutional learning management system
* you'll find assignments and grades here

## Slack

* group communication tool
* primary communication tool in this course (instead of email)

## Jupyter and JupyterLab
* What is Jupyter?
> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
* Why Jupyter?
 * Interactive, reproducible results, literate programming, REPL (read-eval-print loop)
 * great for data exploration
* Why JupyterLab?
 * next-generation UI for Jupyter

## Jupyter and Python

* in the beginning: Python
* later: IPython
* still later: Jupyter notebooks
 * not just python (R, Julia, etc.)
* different from scripting
* great for data analysis
* not great for software engineering (see Joel Grus' ["I don't like notebooks"](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1) presentation)

## Next steps
1. Follow the invitation link to Slack (see Canvas Announcements)
1. Install [anaconda](https://www.anaconda.com/download/) (if you haven't already)
1. Create a folder for your work in this class (e.g. si330)
2. Start JupyterLab, either from the command line in Terminal (Mac) or PowerShell (Windows) or using Anaconda-Navigator.
3. Download Day1.zip from Canvas -> Files, unzip it, move it to your 330 folder and open it in **JupyterLab**
4. We'll start working the lab together, share some insights, submit your first notebook for the class, and talk about prepping for next class


## Learning Objectives
* install and run JupyterLab
* ensure you can use needed libraries
* be able to run a class notebook
* write your first code in this class
* practice submitting an assignment

### <font color="magenta">Q1: (2 points) What are you looking forward to learning in this class?  

Insert your answer here.

### <font color="magenta">Q2: (2 points) What are you most concerned about in this class?

Insert your answer here.

### <font color="magenta"> Q3: (3 points) Run the following cells (hint: use Shift-Enter)

In [None]:
import numpy as np

In [None]:
# NOTE: If the above cell gives you an error, uncomment the following line
#       and run this cell (Shift-Enter).  It will take several minutes to finish.
#!conda install -y numpy pandas matplotlib

In [None]:
import pandas as pd

In [None]:
import matplotlib.pyplot as plt

In [None]:
df = pd.DataFrame({
    'name':['john','vj','xin','amanda','sungjin','lisa','jose'],
    'age':[23,78,22,19,45,33,20],
    'num_bikes':[1,2,0,0,3,2,0],
    'num_pets':[0,1,0,3,2,2,3]
})

In [None]:
%matplotlib inline

In [None]:
# a scatter plot comparing num_children and num_pets
df.plot(kind='scatter',x='num_bikes',y='num_pets',color='blue')
plt.show()

### <font color="magenta">Q4: (3 points) What does the above plot tell you?

Insert your answer here.

## End of notebook
## Remember to submit this notebook to Canvas in both HTML and ipynb formats.