# [__Introduction to Python__](https://eventum.upf.edu/64915/section/30365/recsm-summer-methods-school-2021.html)

### [Tom Paskhalis](https://tom.paskhal.is/)

### [RECSM Summer School 2021](https://eventum.upf.edu/64915/detail/recsm-summer-methods-school-2021.html), Pandas and Data I/O, Part 3, Day 1

## Data in Python

- Python can hold and manipulate > 1 dataset at the same time
- Python stores objects in memory
- The limit on the size of data is determined by your computer memory
- Most functionality for dealing with data is provided by external libraries

## Pandas

- Standard Python library does not have data type for tabular data
- However, `pandas` library has become the de facto standard for data manipulation
- pandas is built upon (and often used in conjuction with) other computational libraries
- E.g. `numpy` (array data type), `scipy` (linear algebra) and `scikit-learn` (machine learning)

In [1]:
import pandas as pd # Using 'as' allows to avoid typing full name each time the module is invoked

## Series

- *Series* is a one-dimensional array-like object

In [2]:
sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1

0     150.0
1     120.0
2    3000.0
dtype: float64

In [3]:
sr1[0] # Slicing is simiar to standard Python objects

150.0

In [4]:
sr1[sr1 > 200]

2    3000.0
dtype: float64

## Indexing in Series

- Another way to think about Series is as a ordered dictionary

In [5]:
d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}

In [6]:
sr2 = pd.Series(d)
sr2

apple          150.0
banana         120.0
watermelon    3000.0
dtype: float64

In [7]:
sr2[0] # Recall that this slicing would be impossible for standard dictionary

150.0

In [8]:
sr2.index

Index(['apple', 'banana', 'watermelon'], dtype='object')

## DataFrame - the workhorse of data analysis

- *DataFrame* is a rectangular table of data

In [9]:
data = {'fruit': ['apple', 'banana', 'watermelon'], # DataFrame can be constructed from
        'weight': [150.0, 120.0, 3000.0],           # a dict of equal-length lists/arrays
        'berry': [False, True, True]}           
df = pd.DataFrame(data)
df

Unnamed: 0,fruit,weight,berry
0,apple,150.0,False
1,banana,120.0,True
2,watermelon,3000.0,True


## Indexing in DataFrame

- DataFrame has both row and column indices
- `DataFrame.loc()` provides method for *label* location
- `DataFrame.iloc()` provides method for *index* location

In [10]:
df.iloc[0] # First row

fruit     apple
weight    150.0
berry     False
Name: 0, dtype: object

In [11]:
df.iloc[:,0] # First column

0         apple
1        banana
2    watermelon
Name: fruit, dtype: object

## Summary of indexing in DataFrame

| Expression             | Selection Operation                                     |
|:-----------------------|:--------------------------------------------------------|
| `df[val]`              | Column or sequence of columns +convenience (e.g. slice) |
| `df.loc[lab_i]`        | Row or subset of rows by label                          |
| `df.loc[:, lab_j]`     | Column or subset of columns by label                    |
| `df.loc[lab_i, lab_j]` | Both rows and columns by label                          |
| `df.iloc[i]`           | Row or subset of rows by integer position               |
| `df.iloc[:, j]`        | Column or subset of columns by integer position         |
| `df.iloc[i, j]`        | Both rows and columns by integer position               |
| `df.at[lab_i, lab_j]`  | Single scalar value by row and column label             |
| `df.iat[i, j]`         | Single scalar value by row and column integer position  |


## Subsetting in DataFrame

In [12]:
df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)

Unnamed: 0,fruit,weight,berry
0,apple,150.0,False
1,banana,120.0,True


In [13]:
df[:2]  # Shortcut

Unnamed: 0,fruit,weight,berry
0,apple,150.0,False
1,banana,120.0,True


In [14]:
df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'

Unnamed: 0,fruit,berry
0,apple,False
1,banana,True
2,watermelon,True


In [15]:
df[['fruit', 'berry']] # Shortcut

Unnamed: 0,fruit,berry
0,apple,False
1,banana,True
2,watermelon,True


## Columns in DataFrame

In [16]:
df.columns # Retrieve the names of all columns

Index(['fruit', 'weight', 'berry'], dtype='object')

In [17]:
df.columns[0] # This Index object is subsettable

'fruit'

In [18]:
df.columns.str.startswith('fr') # As column names are strings, we can apply str methods

array([ True, False, False])

In [19]:
df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria

Unnamed: 0,fruit
0,apple
1,banana
2,watermelon


## Filtering in DataFrame

In [20]:
df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries

Unnamed: 0,fruit,weight,berry
0,apple,150.0,False


In [21]:
df[df['berry'] == False] # The same can be achieved with more concise syntax

Unnamed: 0,fruit,weight,berry
0,apple,150.0,False


In [22]:
weight200 = df[df['weight'] > 200] # Create new dataset with rows where weight is higher than 200
weight200

Unnamed: 0,fruit,weight,berry
2,watermelon,3000.0,True


## Reading data in Python

- We will use the data from [Kaggle](https://www.kaggle.com) [2020 Machine Learning and Data Science Survey](https://www.kaggle.com/c/kaggle-survey-2020/)
- For more information you can read the [executive summary](https://www.kaggle.com/kaggle-survey-2020)
- Or explore the [winning Python Jupyter Notebooks](https://www.kaggle.com/c/kaggle-survey-2020/discussion/212949)

In [23]:
# We specify that we want to combine first two rows as a header
kaggle2020 = pd.read_csv('../data/kaggle_survey_2020_responses.csv', header = [0,1])

In [24]:
kaggle2020.head() # Returns the top n (n=5 default) rows

Unnamed: 0_level_0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,...,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
Unnamed: 0_level_1,Duration (in seconds),What is your age (# years)?,What is your gender? - Selected Choice,In which country do you currently reside?,What is the highest level of formal education that you have attained or plan to attain within the next 2 years?,Select the title most similar to your current role (or most recent title if retired): - Selected Choice,For how many years have you been writing code and/or programming?,What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python,What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R,What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL,...,"In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Weights & Biases","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Trains","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None","In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other"
0,1838,35-39,Man,Colombia,Doctoral degree,Student,5-10 years,Python,R,SQL,...,,,,TensorBoard,,,,,,
1,289287,30-34,Man,United States of America,Master’s degree,Data Engineer,5-10 years,Python,R,SQL,...,,,,,,,,,,
2,860,35-39,Man,Argentina,Bachelor’s degree,Software Engineer,10-20 years,,,,...,,,,,,,,,,
3,507,30-34,Man,United States of America,Master’s degree,Data Scientist,5-10 years,Python,,SQL,...,,,,,,,,,,
4,78,30-34,Man,Japan,Master’s degree,Software Engineer,3-5 years,Python,,,...,,,,,,,,,,


## Reading in other (non-`.csv`) data files

- Pandas can read in file other than `.csv` (comma-separated value)
- Common cases include STATA `.dta`, SPSS `.sav` and SAS `.sas`
- Use `pd.read_stata(path)`, `pd.read_spss(path)` and `pd.read_sas(path)`
- Check [here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html) for more examples

## Writing data out in Python

- Note that when writing data out we start with the object name storing the dataset
- I.e. `df.to_csv(path)` as opposed to `df = pd.read_csv(path)`
- Pandas can also write out into other data formats
- E.g. `df.to_excel(path)`, `df.to_stata(path)`

In [25]:
kaggle2020.to_csv('../temp/kaggle2020.csv')

## Additional pandas materials

Books:

- McKinney, Wes. 2017. *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython*. 2nd ed. Sebastopol, CA: O'Reilly Media 
  
  **From the original author of the library!**

Online:

- [Pandas Getting Started Tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
- [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html) (intermediate and advanced)

## Tomorrow

- Exploratory data analysis
- Data visualization