# Intro to Pandas
by Ryan Orsinger

## Module 1: Intro to pandas series
- Creating series from Python collections
- Doing math on series of numbers
- Describing our data
- Filtering series
- Operating on series of strings
- Using built-in series attributes and methods

## What is pandas?
- The leading data analysis library for Python
- Built for acquiring, cleaning, organizing, and analyzing data
- Series and DataFrames are powerful pandas datasets

## Why pandas?
- Pandas is the *de facto* library for data analysis in Python ecosystem
- Pandas is also *fast*, faster than base Python
- Enables accomplishing more with less code

### Pandas Series Part 1
- Creating series objects
- Assigning series
- Doing math on series
- Describing a series

In [1]:
import pandas as pd

In [2]:
pd.Series([7, 8, 9])

0    7
1    8
2    9
dtype: int64

In [3]:
# Assigning a series to a variable 
results = pd.Series([True, False, True])
results

0     True
1    False
2     True
dtype: bool

In [4]:
# Series can be any Python data type
colors = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
colors = pd.Series(colors)
colors

0       red
1    orange
2    yellow
3     green
4      blue
5    indigo
6    violet
dtype: object

In [5]:
# We can assign ranges to make series of numbers
numbers = pd.Series(range(-3, 3))
numbers

0   -3
1   -2
2   -1
3    0
4    1
5    2
dtype: int64

In [6]:
# We can do arithmetic on entire series with our math operators
numbers + 1

0   -2
1   -1
2    0
3    1
4    2
5    3
dtype: int64

In [7]:
# Pandas, like Python, follows PEMDAS order of operations
numbers * 2 + 5

0   -1
1    1
2    3
3    5
4    7
5    9
dtype: int64

In [8]:
# Notice how Python's built-in operators work on the entire series
numbers ** 2

0    9
1    4
2    1
3    0
4    1
5    4
dtype: int64

In [9]:
# We can take the square root by raising to the 1/2 power
numbers ** (1/2)

0         NaN
1         NaN
2         NaN
3    0.000000
4    1.000000
5    1.414214
dtype: float64

In [10]:
# Notice that arithmetic does not change the original series
numbers

0   -3
1   -2
2   -1
3    0
4    1
5    2
dtype: int64

In [11]:
# Assigning the result of an operation to a new variable
triple = numbers * 3
triple

0   -9
1   -6
2   -3
3    0
4    3
5    6
dtype: int64

In [12]:
prices = pd.Series([1.30, 2.50, 2.50, 5.60, 10.10])
prices

0     1.3
1     2.5
2     2.5
3     5.6
4    10.1
dtype: float64

In [13]:
# Reassigning a variable to overwrite the values with the result of an operation
prices = prices * .8
prices

0    1.04
1    2.00
2    2.00
3    4.48
4    8.08
dtype: float64

In [14]:
s = pd.Series([7, 8, 8, 9, 9, 9])
s

0    7
1    8
2    8
3    9
4    9
5    9
dtype: int64

In [15]:
# The .index attribute returns information about the index
# Zero based integer indexes are the default
# Pandas can also use strings and dates as index values
s.index

RangeIndex(start=0, stop=6, step=1)

In [16]:
# The .dtype attribute returns
s.dtype

dtype('int64')

In [17]:
# The .values attribute returns only the values from a pandas dataset
s.values

array([7, 8, 8, 9, 9, 9])

In [18]:
# On a series, .shape returns the number of elements in that series.
# On a dataframe, .shape returns the number of rows and columns
s.shape

(6,)

In [19]:
# .value_counts returns a frequency count of values
# The index is the value
s.value_counts()

9    3
8    2
7    1
dtype: int64

In [20]:
# Mode is the most frequently occurring value in a dataset
s.mode()

0    9
dtype: int64

In [21]:
# Median is the ordinal middle of the sorted data
s.median()

8.5

In [22]:
# Average
s.mean()

8.333333333333334

In [23]:
# Standard deviation is a measure of spread
s.std()

0.816496580927726

In [24]:
# .min returns the lowest value
s.min()

7

In [25]:
# argmin returns the index of the lowest value
s.argmin()

0

In [26]:
s.max()

9

In [27]:
s.argmax()

3

In [28]:
s

0    7
1    8
2    8
3    9
4    9
5    9
dtype: int64

In [29]:
# The .describe method outputs some helpful descriptive statistics
s.describe()

count    6.000000
mean     8.333333
std      0.816497
min      7.000000
25%      8.000000
50%      8.500000
75%      9.000000
max      9.000000
dtype: float64

### Exercise check-in, part 1 of 3
- Create a series named `a` that is the numbers `[1, 2, 3, 4, 5]`
- Create a series named `b` that is the numbers `[1, 1, 2, 3, 5]`
- Square `a` and reassign to the variable `a`
- Square `b` and reassign to the variable `b`
- Add the squares `a` and `b`. Assign to a variable named `sum_of_squares`
- Now take the square root of that sum (*hint* raising to the 0.5 power takes the square root)

In [30]:
# Create a series named "a" and assign it the numbers [1, 2, 3, 4, 5]
a = pd.Series([1, 2, 3, 4, 5])
a

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [31]:
# Create a series named "b" and assign it the numbers [1, 1, 2, 3, 5]
b = pd.Series([1, 1, 2, 3, 5])
b

0    1
1    1
2    2
3    3
4    5
dtype: int64

In [32]:
# Square all the numbers in a and reassign the result to a
a = a ** 2
a

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [33]:
# Square all the numbers in b and reassign the result to b
b = b ** 2
b

0     1
1     1
2     4
3     9
4    25
dtype: int64

In [38]:
# Create a series named sum_of_squares that holds the sum of a and b squares
sum_of_squares = a + b
sum_of_squares

0     2
1     5
2    13
3    25
4    50
dtype: int64

In [39]:
# Evaluate the square root of that sum_of_squares
sum_of_squares ** (1/2)

0    1.414214
1    2.236068
2    3.605551
3    5.000000
4    7.071068
dtype: float64

In [40]:
# Another approach for applying a math function to a series
# Import the square root function from the math library
from math import sqrt
sqrt(4)

2.0

For more on the pandas .apply method, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

In [41]:
# use the pandas .apply method to apply a function definition to your series.
sum_of_squares.apply(sqrt)

0    1.414214
1    2.236068
2    3.605551
3    5.000000
4    7.071068
dtype: float64