# Intro to Jupyter Notebook and Other Useful Techniques!

Jupyter Notebook is a very useful tool to write organized Python code. Instead of writing all of your code in one long script, Jupyter Notebook allows you to organize your code in chunks called cells. This text cell (and the heading) are made in a markdown cell - it can be used to write notes about your code or hold long non-code blocks of text! Some people even write entire reports or presentations in Jupyter Notebook. This is great if you want to include a lot of code or run code during a presentation. Using the dropdown menu between the refresh and keyboard buttons, change the next cell to markdown, type something, and press the "run" button above to run it!

Hello Dr. Rigas

This is some new text! 

Hover over the white buttons starting with the save icon to learn what each one does!

Keyboard Shortcuts
- ctrl + enter: run a cell
- when the cell has a BLUE box around it (clicking by the 'In []:' part of the page), press a to add a cell above it and b to add a cell below it

# Pandas and Regular Expressions!

Pandas is a very useful data analytics library and will be useful to know in your data science/analytics classes and careers. This tutorial will teach you how to load a dataset into a data frame and make use of regular expressions. To learn more about regular expressions see this link: https://docs.python.org/3/library/re.html

In [1]:
#import libraries
import pandas as pd
import re

We are going to look at customer data from a fictional company. Start by reading the dataset chinook_customers.csv into a dataframe called customers in the cell below. Run the cell once you're done to see the data.
- pd.read_csv('insert file name here')

In [2]:
customers = pd.read_csv('chinook_customers.csv')
customers

Unnamed: 0,CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
0,1,Luís,Gonçalves,Embraer - Empresa Brasileira de Aeronáutica S.A.,"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP,Brazil,12227-000,+55 (12) 3923-5555,+55 (12) 3923-5566,luisg@embraer.com.br,3
1,2,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
2,3,François,Tremblay,,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,+1 (514) 721-4711,,ftremblay@gmail.com,3
3,4,Bjørn,Hansen,,Ullevålsveien 14,Oslo,,Norway,171,+47 22 44 22 22,,bjorn.hansen@yahoo.no,4
4,5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw%jetbrains.com,4
5,6,Helena,Holý,,Rilská 3174/6,Prague,,Czech Republic,14300,+420 2 4177 0449,,hholy@gmail.com,5
6,7,Astrid,Gruber,,"Rotenturmstraße 4, 1010 Innere Stadt",Vienne,,Austria,1010,+43 01 5134505,,astrid.gruber@apple.at,5
7,8,Daan,Peeters,,Grétrystraat 63,Brussels,,Belgium,1000,+32 02 219 03 03,,daan_peeters@apple.be,4
8,9,Kara,Nielsen,,Sønder Boulevard 51,Copenhagen,,Denmark,1720,+453 3331 9991,,kara.nielsen@jubii.dk,4
9,10,Eduardo,Martins,Woodstock Discos,"Rua Dr. Falcão Filho, 155",São Paulo,SP,Brazil,01007-010,+55 (11) 3033-5446,+55 (11) 3033-4564,eduardo@woodstock.com.br,4


Now, there are a lot of things you can do with pandas now that this data is loaded. Pandas is the foundation for a lot of the data cleaning and machine learning programming tasks in the data science world. For the purpose of this exercise, we are going to use regular expressions to validate email format and phone numbers!

# Regular Expressions: Emails

Often when we are working with data, we want to search for a particular pattern in the data, for example phone numbers and email addresses follow common patterns. This task of searching and extracting is so common that Python has a very powerful library called regular expressions that handles many of these tasks quite elegantly. The syntax is a little odd, but once you get used to it, you will see how powerful they are and how easy they can make your data managing life. 

Entire books have been written on the topic of regular expressions. A relatively simple tutorial on RegEx in Python can be found here:
https://www.w3schools.com/python/python_regex.asp

For more detail on regular expressions, see:
https://docs.python.org/library/re.html

Let's first look at emails as an example where we have created the pattern that is searched. See if you can understand it. 

- The goal: make sure the emails are all in a standard format of "text/numbers@text/numbers.text"
- Step 1 (completed for you, run the cell): create a true/false email format column - this column will say True if the email format is valid or False if it's not
- Step 2: write the regular expression and populate the valid email column
- Step 3 (completed for you, run the cell): return the rows where the true/false column is false - you should see 4 invalid emails!


In [3]:
#Step 1 - inserting a NA column called ValidEmail next to the email column
import numpy as np
#customers.insert(loc=12, column='ValidEmail', value=np.nan)

In [4]:
#Step 2 - regular expression to detect emails - try to look understand the regex pattern 
regex = re.compile(r"\w\S*@.*\w\S\.\w\S") 
customers['ValidEmail'] = customers['Email'].apply(lambda x: 'True' if regex.match(x) else 'False')
customers

Unnamed: 0,CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId,ValidEmail
0,1,Luís,Gonçalves,Embraer - Empresa Brasileira de Aeronáutica S.A.,"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP,Brazil,12227-000,+55 (12) 3923-5555,+55 (12) 3923-5566,luisg@embraer.com.br,3,True
1,2,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5,True
2,3,François,Tremblay,,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,+1 (514) 721-4711,,ftremblay@gmail.com,3,True
3,4,Bjørn,Hansen,,Ullevålsveien 14,Oslo,,Norway,171,+47 22 44 22 22,,bjorn.hansen@yahoo.no,4,True
4,5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw%jetbrains.com,4,False
5,6,Helena,Holý,,Rilská 3174/6,Prague,,Czech Republic,14300,+420 2 4177 0449,,hholy@gmail.com,5,True
6,7,Astrid,Gruber,,"Rotenturmstraße 4, 1010 Innere Stadt",Vienne,,Austria,1010,+43 01 5134505,,astrid.gruber@apple.at,5,True
7,8,Daan,Peeters,,Grétrystraat 63,Brussels,,Belgium,1000,+32 02 219 03 03,,daan_peeters@apple.be,4,True
8,9,Kara,Nielsen,,Sønder Boulevard 51,Copenhagen,,Denmark,1720,+453 3331 9991,,kara.nielsen@jubii.dk,4,True
9,10,Eduardo,Martins,Woodstock Discos,"Rua Dr. Falcão Filho, 155",São Paulo,SP,Brazil,01007-010,+55 (11) 3033-5446,+55 (11) 3033-4564,eduardo@woodstock.com.br,4,True


In [5]:
#Step 3 - return invalid email rows!
customers[customers['ValidEmail'] == 'False']

Unnamed: 0,CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId,ValidEmail
4,5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw%jetbrains.com,4,False
24,25,Victor,Stevens,,319 N. Frances Street,Madison,WI,USA,53703,+1 (608) 257-0597,,vstevens@yahoo/com,5,False
51,52,Emma,Jones,,202 Hoxton Street,London,,United Kingdom,N1 5LH,+44 020 7707 0707,,emma_jones=hotmail.com,3,False
58,59,Puja,Srivastava,,"3,Raj Bhavan Road",Bangalore,,India,560001,+91 080 22289999,,puja_srivastava++yahoo.in,3,False


# Regular Expressions: Phone Numbers


Now we are going to set up a VERY simple data set with phone numbers. 
Since the phone numbers in this data vary, you will need to write regular expressions to detect the 3 phone number formats provided in the string below.

In [6]:
#RUN ME!
phonenumbers = '1234567890, 123-456-7890, (123) 456-7890'

In [24]:
#Find the first phone number pattern 1234567890 - if done properly when this code block is run, you will return the correct phone number from the set of phone numbers
pattern1 = re.findall(r"\d{10}", phonenumbers) 
for number in pattern1:
    print(number)

1234567890


In [23]:
#Find the second phone number pattern 123-456-7890
pattern2 = re.findall(r"\d{3}\S\d{3}\S\d{4}", phonenumbers) 
for number in pattern2:
    print(number)

123-456-7890


In [22]:
#See if you can complete the third phone number pattern (123) 456-7890
pattern3 = re.findall(r"\(\d{3}\)\s\d{3}\S\d{4}", phonenumbers) 
for number in pattern3:
    print(number)

(123) 456-7890
