![alt text](pandas.png "Title")

In [0]:
import pandas as pd

# Dataframes: accessing values

In [0]:
# Let's create a dataframe

patients = [10010, 10011, 10012]
data = {
    'gender': ['M', 'F', 'F'],
    'age':    [20, 25, 23],
}

df = pd.DataFrame(data, index= patients, columns=['age', 'gender', 'race'])
df

In [0]:
# Looking at the raw data, we can accessing the values in the Python dict: 
data['age']

In [0]:
# A dataframe works as a dictionary of pandas Series. Let's look at a column:
df['age']

# Notice the Series comes with the index column as well

In [0]:
# What type is that? 
type(df['age'])

In [0]:
# You can also use the dot syntax to display the Series:
df.age

# This syntax only works if the Series has a valid variable name (no special characters).
# For example, if the variable is named 'My Var', you cannot use the dot syntax and must use df['My Var']

In [0]:
# Switch between pandas and core Python

# You can always convert a pandas object to a regular Python object:
ages = tuple(df['age'])

# Now you're free to iterate, manipulate, aggregate, do whatever, in pure Python.
# Sometimes it feels easier actually, but there's probably a more simple way in pandas...

# Let's accept, for now, our ignorance of pandas and calculate the age average using tuples:
mean = sum(ages) / len(ages)

# Done with the "heavy lifting", let's come back to pandas and create a new column with a vectorized approach:
df['mean'] = mean
df

In [0]:
# pandas, like Python, is case sensitive!
print (df.Age)

In [0]:
# Accessing a value inside the df: df > series > value
age_10010 = df['age'][10010]
print(age_10010)

# We retrieved a value, an integer actually using the NumPy int64 type
print('Type=', type(age_10010) )

## loc and iloc

These two dataframes methods are useful for selection, filtering and setting values.
* loc uses name ( [reference]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) )
* iloc uses position ( [reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html?highlight=iloc#pandas.DataFrame.iloc))

In [0]:
df

In [0]:
# loc retrieves a given row in the df, returning a Series where the index are the column names
# loc takes an index value:
df.loc[10010]

# I look at it as kind of a pivot/transpose...

In [0]:
# iloc does the same but instead of the index value, it takes the index *position* in the df

# Retrieves the first df row
df.iloc[0] 

# Retrieves the last df row
df.iloc[-1] 

# and we can take bigger slices too, e.g. with first two rows:
df.iloc[0:2] 

In [0]:
# We can access values using iloc/loc too:
df.loc[10010, 'age']  

# Syntax: [row index, column index]

In [0]:
# Different syntax, same outcome
df.loc[10010, 'age'] == df.loc[10010]['age'] == df['age'][10010] 

In [0]:
# You can also easily subset a df with the head() method. I'm keeping the first 2 rows here:
df.head(2)

# df.head() is the default and returns the first 5 rows

In [0]:
# the last 5 records
df.tail()

__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+