# Semester Part 3: Data Science

Selection of topics
 - Pandas module
 - Requests module : (HTML GET / POST, also APIs but we won't cover that)
 - Beautifulsoup module : HTML parsing
 - SQLite : databases, hook into pandas
 - Matplotlib module : Figures & graphs
 
Why these topics?
 - To the best of our knowledge, these are the tools you are most likely to rely on
 - If you know other tools that are coming into common use, let us know!

How can I get the most out of this part of the course?
 - Be aware that you're likely going to specialize and use some of these tools a LOT, others not much or not at all
 - Treat these lectures and these notes as the first draft of your own reference material

# Outline for Wednesday, March 31
## Pandas 1 - Series

Core ideas:
 - Pandas helps deal with tabular (tables) data
 - List of lists is not good enough to match spreadsheets
 - Series: new data structure
     - hybrid of a dict and a list
     - Python dict "key" equivalent to "index" in pandas
     - Python list "index" quivalent to "integer position" in pandas
     - supports complicated expressions within lookup [...]
     - supports element-wise operations
     - supports boolean indexing
 - DataFrames aka tables (topic for Monday)
     - built from series
     - each series will be a column in the table

**Try importing pandas. If you don't already have it, run this command in your shell:**

pip install pandas

## Module naming abbreviation

Many pandas users abbreviate it as "pd".

In [30]:
import pandas as pd

In [33]:
pd.Series

pandas.core.series.Series

## Create a series from a dict

In [34]:
d = {"one":7, "two":8, "three":9}
d

{'one': 7, 'two': 8, 'three': 9}

In [35]:
s = pd.Series(d)
s

one      7
two      8
three    9
dtype: int64

In [None]:
# IP  index    value
# 0   "one"    7
# 1   "two"    8
# 2   "three"  9

## Accessing values with index (.loc[...])

In [2]:
# dict access with key
d["one"]

7

In [36]:
s.loc["one"]

7

In [37]:
s.loc["three"]

9

## Accessing values with integer position (.iloc[...])

In [38]:
s.iloc[0]

7

In [39]:
s.iloc[2]

9

In [40]:
s.iloc[-1]

9

In [41]:
s["one"] #If we just put the brackets, pandas will attempt to guess whether we mean loc or iloc

7

In [42]:
s[1]

8

## Accessing multiple values with a list of integer positions

In [43]:
s[[0,2]]

one      7
three    9
dtype: int64

In [45]:
s[["one","three"]]

one      7
three    9
dtype: int64

## Create a series from a list

In [46]:
num_list = [100, 200, 300]
s = pd.Series(num_list)
s

0    100
1    200
2    300
dtype: int64

In [None]:
# IP  index  value
# 0   0      100
# 1   1      200
# 2   2      300

In [48]:
print(s.loc[1])
print(s.iloc[1])

200
200


In [51]:
letters_list = ["A","B","d","E"]
letters = pd.Series(letters_list) #Note the dtype: object (not str!)
letters.iloc[-1]

'E'

## Slicing series using integer positions

In [52]:
letters_list = ["A", "B", "C", "D"]
letters = pd.Series(letters_list)

In [5]:
letters_list

['A', 'B', 'C', 'D']

In [6]:
sliced_letter_list = letters_list[2:]
sliced_letter_list

['C', 'D']

In [7]:
sliced_letter_list[0]

'C'

In [53]:
letters

0    A
1    B
2    C
3    D
dtype: object

In [57]:
sliced_letters = letters[2:]
sliced_letters.iloc[:1] #Remember the integer positions get renumbered because it's a new Series!

2    C
dtype: object

In [None]:
# sliced_letters

# IP  index  value
# 0   2       C
# 1   3       D

In [58]:
sliced_letters[2]

'C'

## Slicing series using index

In [59]:
s = pd.Series({"one":7, "two":8, "three":9})
s

one      7
two      8
three    9
dtype: int64

In [61]:
s["two":]

two      8
three    9
dtype: int64

## Element-wise operations
1. SERIES op SCALAR
2. SERIES op SERIES

Casual rule of pandas: NO FOR LOOPS

In [62]:
nums = [1, 2, 3]
nums * 3

[1, 2, 3, 1, 2, 3, 1, 2, 3]

In [63]:
nums + 3

TypeError: can only concatenate list (not "int") to list

In [64]:
nums / 3

TypeError: unsupported operand type(s) for /: 'list' and 'int'

In [66]:
snums = pd.Series(nums)
snums

0    1
1    2
2    3
dtype: int64

In [67]:
snums * 3

0    3
1    6
2    9
dtype: int64

In [68]:
snums + 3

0    4
1    5
2    6
dtype: int64

In [69]:
snums / 3

0    0.333333
1    0.666667
2    1.000000
dtype: float64

In [71]:
snums += 2
snums

0    5
1    6
2    7
dtype: int64

In [72]:
#list recap
l1 = [1, 2, 3]
l2 = [4, 5, 6]
l1 + l2

[1, 2, 3, 4, 5, 6]

In [73]:
s1 = pd.Series(l1)
s2 = pd.Series(l2)
s1 + s2

0    5
1    7
2    9
dtype: int64

In [74]:
s1 * s2

0     4
1    10
2    18
dtype: int64

In [75]:
s1 ** s2

0      1
1     32
2    729
dtype: int64

In [76]:
s1 / s2

0    0.25
1    0.40
2    0.50
dtype: float64

In [77]:
s1 < s2

0    True
1    True
2    True
dtype: bool

## What happens to element-wise operation if we have two series with different sizes?

In [78]:
pd.Series([1, 2, 3]) + pd.Series([4, 5])
# NaN stands for "Not a Number" and is used when we have missing data

0    5.0
1    7.0
2    NaN
dtype: float64

## Series with different types

In [79]:
L = ["a", "Alice", True, 1, 4.5, [1,2], {"a":"Alice"}]
L

['a', 'Alice', True, 1, 4.5, [1, 2], {'a': 'Alice'}]

In [80]:
pd.Series(L)

0                 a
1             Alice
2              True
3                 1
4               4.5
5            [1, 2]
6    {'a': 'Alice'}
dtype: object

## How do you merge two series?

In [81]:
s1 = pd.Series([1, 2, 3]) 
s2 = pd.Series([4, 5])
print(s1)
print(s2)

0    1
1    2
2    3
dtype: int64
0    4
1    5
dtype: int64


In [82]:
s = pd.concat([s1,s2])
s

0    1
1    2
2    3
0    4
1    5
dtype: int64

In [83]:
s.iloc[3]

4

In [84]:
s.loc[0]

0    1
0    4
dtype: int64

## Element-wise Ambiguity

In [85]:
s1 = pd.Series({"A":10, "B": 20 })
s2 = pd.Series({"B":1, "A": 2 })
print(s1)
print(s2)

A    10
B    20
dtype: int64
B    1
A    2
dtype: int64


In [86]:
s1 + s2 #INDEX ALIGNMENT - when there are matching indexes, that takes priority for element-wise operations

A    12
B    21
dtype: int64

## How to insert an index-value pair?

In [88]:
s = pd.Series({"A":10, "B": 20 })
print(s)
s["Z"] = 100
s

A    10
B    20
dtype: int64


A     10
B     20
Z    100
dtype: int64

## Boolean indexing

In [89]:
s = pd.Series([10, 2, 3, 15])
s

0    10
1     2
2     3
3    15
dtype: int64

## How to extract numbers > 8?

In [90]:
b = pd.Series([True, False, False, True])
b

0     True
1    False
2    False
3     True
dtype: bool

In [91]:
s[b]

0    10
3    15
dtype: int64

In [92]:
s

0    10
1     2
2     3
3    15
dtype: int64

In [93]:
c = s > 8
c

0     True
1    False
2    False
3     True
dtype: bool

In [94]:
s[c]

0    10
3    15
dtype: int64

In [95]:
s[s > 8] #This is the goal!

0    10
3    15
dtype: int64

## Element-wise String operations

In [96]:
words = pd.Series(["APPLE", "boy", "CAT", "dog"])
words

0    APPLE
1      boy
2      CAT
3      dog
dtype: object

In [97]:
words.upper()

AttributeError: 'Series' object has no attribute 'upper'

In [98]:
words.str.upper()

0    APPLE
1      BOY
2      CAT
3      DOG
dtype: object

In [99]:
b = words == words.str.upper()
b

0     True
1    False
2     True
3    False
dtype: bool

In [100]:
words[b]

0    APPLE
2      CAT
dtype: object

In [101]:
words[words == words.str.upper()]

0    APPLE
2      CAT
dtype: object

## How to get the odd numbers from a list?

In [102]:
s = pd.Series([10, 19, 11, 30, 35])
s

0    10
1    19
2    11
3    30
4    35
dtype: int64

In [103]:
s % 2

0    0
1    1
2    1
3    0
4    1
dtype: int64

In [104]:
b = s % 2 == 1
b

0    False
1     True
2     True
3    False
4     True
dtype: bool

In [105]:
s[b]

1    19
2    11
4    35
dtype: int64

In [106]:
s[s%2==1]

1    19
2    11
4    35
dtype: int64

## BOOLEAN OPERATORS on series: and, or, not 

## How to get numbers < 12 or numbers > 33?

In [107]:
s

0    10
1    19
2    11
3    30
4    35
dtype: int64

In [110]:
s[s < 12 or s > 33] #This won't work (or, and, not aren't recognized like we'd like them to be)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [111]:
s[s<12 | s>33] #Replace "or" with "|"

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [114]:
s[(s<12) | (s>33)]

0    10
2    11
4    35
dtype: int64

In [115]:
s[(s>=12) & (s<=33)] # and is replaced by "&"

1    19
3    30
dtype: int64

In [117]:
s[~((s>=12) & (s<=33))] #not is replaced by "~"

0    10
2    11
4    35
dtype: int64