# Pandas: Getting Started

## Pandas Series

Pandas series objects are used to store and manipulate single dimensional arrays of numbers. These are the basic data structures in pandas, which are a mix of lists and dictionaries, so you have indices of each data and string keys to define columns.

Series on Pandas come from NumPy arrays, but Series has the values and indices expressed, not just intern.

In [32]:
import pandas as pd
import numpy as np

In [33]:
# Pandas stores our list in a Series by creating a new numpy array and sotring our values on it
# This numpy aproach give to us various advantages, like more efficientcy when manipulating arrays
students = ['Wiliam', 'Molly', 'Alex']

pd.Series(students)

0    Wiliam
1     Molly
2      Alex
dtype: object

In [34]:
numbers = [1, 2, 3]

pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

## None Values

When creating a Series, Python verfies which element we are treating on the series. If it finds a None value, it will store it anyway on the numpy array and the type of the array will be of the underlyig array(without the None).

In [35]:
testing = ['William', None, 'Bruno']

pd.Series(testing)
# try it: pd.Series(testing).dropna() -> 'drop all None values'

0    William
1       None
2      Bruno
dtype: object

In [36]:
# When recieving an array of integers or float numbers to creat a Series, Pandas verifies if the array has None objects.
# If it has, pandas converts it to NaN (Not A Number) object and the rest of the numbers, dispite its integers or not, to
# floating points
numbers = [1, 2, None]

pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

## Why None Values?

For data scientists, None or NaN values are used to refference missing data. It's important to know how to clean these missing values from the data set if them do not improve any knowledge.

## How to know if there is NaN values?

Simple programming comparisons does not act like we wanted to NaN values. Because of that, it's important to use the ``isnan()`` functin which is present on the numpy library. It checks if a object is a NaN value.

In [37]:
np.isnan(np.nan)

True

## Creating Series from Dictionaries

We've saw that when creating a list and passing it as argument to create a pandas series it will store for each element a key value which is a integer growing from 0 to len(list) - 1.

But we can create a series from a dictionary as well. On that case, the key values will not be integers anymore, but using the key values from the dict instead.

In [38]:
students_grades = {'Molly': 'Physics',
                   'William': 'Math',
                   'Andre': 'Chemistry',
                  }
s = pd.Series(students_grades)
s

Molly        Physics
William         Math
Andre      Chemistry
dtype: object

In [39]:
# hThe dtype: object is not just for strings, is for any object type on `Python
students_last_names = [('William', 'Sales'), ('Marcos', 'Vinicius'), ('João', 'Lucas')]

pd.Series(students_last_names)

0      (William, Sales)
1    (Marcos, Vinicius)
2         (João, Lucas)
dtype: object

## Creating an arbitrary Series

We can also pass as argument on the Series function which indexes we want to pick up from the list or dict. But, what happens if we say for the index argument a key which is not on our dict? Pandas will create a new field on the Series with that key and attribute to it a NaN value. If we exclude some key from the dict, pandas will simply not add it up on the Series.

In [40]:
# Excluding the 'Andre' keyvalue
pd.Series({'Molly': 'Physics', 'William': 'Math', 'Andre': 'Chemistry'}, index=['Molly', 'William'])

Molly      Physics
William       Math
dtype: object

In [41]:
# Argument index has a non exsiting keyvalue 'Sean'
pd.Series({'Molly': 'Physics', 'William': 'Math', 'Andre': 'Chemistry'}, index=['Molly', 'William', 'Sean'])

Molly      Physics
William       Math
Sean           NaN
dtype: object

## Querying Series

There are a lot of ways for querying series from pandas. One of them is accessing the value on a series with it's index key, just the normal use of lists indices.

In [42]:
s

Molly        Physics
William         Math
Andre      Chemistry
dtype: object

In [43]:
print(s['Molly'])
print(s[0])
print(s['Andre'])
print(s[-1])
# print(s['Mollya']) -> KeyError

Physics
Physics
Chemistry
Chemistry


## The Power Of Vectorization

Pandas Series can be treated as a iterable, which means that we can do all the operations that numpy has on our series too. Why just don't make a loop and iterate for each value of the series? We've discussed this before, numpy operations works with paralelism, which means that the library make various computations at the same time, speeding up the time of the algorithm really well.

In [44]:
%%timeit -n 100
numbers = np.random.randint(0, 1000, 10000)
ser = pd.Series(numbers)
total = 0

for number in ser:
    total += number

total/len(numbers)

3.98 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [45]:
%%timeit -n 100
numbers = np.random.randint(0, 1000, 10000)
ser = pd.Series(numbers)

total = np.sum(ser)
total/len(numbers)

534 µs ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Merging Two Series Objects

If we have two series which we want to merge into, we can use the ``append()`` function. This function returns a new series which is one concatenated with the other intead of modifying them.

In [46]:
student_kelly = pd.Series(['English', 'Physics', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
new_series = s.append(student_kelly)

new_series

Molly        Physics
William         Math
Andre      Chemistry
Kelly        English
Kelly        Physics
Kelly           Math
dtype: object

# Pandas Data Frames Objects

Data Frames are the heart of pandas library. We can think about it as a two-dimension array which each column is labeled.

In [47]:
import pandas as pd

In [48]:
record_1 = {'Name': 'William', 'Course': 'Physics', 'Grade': 85}
record_2 = {'Name': 'João', 'Course': 'Math', 'Grade': 95}
record_3 = {'Name': 'Marcos', 'Course': 'Chemistry', 'Grade': 75}

# record_1 = ['João', 'Almdeida'] -> try it

In [49]:
df = pd.DataFrame(data=[record_1, record_2, record_3], index=['School 1', 'School 2', 'School 3'])
df

Unnamed: 0,Name,Course,Grade
School 1,William,Physics,85
School 2,João,Math,95
School 3,Marcos,Chemistry,75


In [50]:
df.loc['School 2']

Name      João
Course    Math
Grade       95
Name: School 2, dtype: object

In [51]:
type(df.loc['School 2'])

pandas.core.series.Series

In [52]:
# We can merge the informations which we want to search by querying df[column][row]
df['Name']['School 1']

'William'

## Droping values

Pandas allow us to drop rows on data frames with the ``drop(row_index)`` function. It drops a row of the df according to the row_index argument. It will not affect out original df, but instead of it will create a copy with the row droped.

In [53]:
df.drop('School 1')

Unnamed: 0,Name,Course,Grade
School 2,João,Math,95
School 3,Marcos,Chemistry,75


In [54]:
df

Unnamed: 0,Name,Course,Grade
School 1,William,Physics,85
School 2,João,Math,95
School 3,Marcos,Chemistry,75


In ``drop()`` function we can even set True to ``inplace`` argument which do not create a copy of the df, but instead manipulate the original itself. Other important argument is the ``axis``, which can be 0 to drop a row or ' to drop a column.

In [55]:
df.drop?

In [56]:
df.drop('School 3', inplace=True)
df

Unnamed: 0,Name,Course,Grade
School 1,William,Physics,85
School 2,João,Math,95


In [59]:
df.drop('Course', inplace=True)

KeyError: "['Course'] not found in axis"