<a href="https://colab.research.google.com/github/sakeefkarim/intro.python.24/blob/main/code/introduction_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to `Python` for Social Research

[Sakeef M. Karim](https://www.sakeefkarim.com/)

sakeef.karim@nyu.edu

## Preliminaries

This notebook is designed to provide an introduction to select libraries in `Python` that are *essential* for data science—inclusive of [`pandas`](https://pandas.pydata.org/) (for data wrangling) and [`seaborn`](https://seaborn.pydata.org/) (for data visualization). More concretely, it will offer some basic code for (i) manipulating tabular data frames; (ii) visualizing descriptive statistics; and (iii) estimating simple regression models. However, this notebook **will not** provide an exhaustive overview of the affordances of `Python` for research in the social and behavioural sciences, nor will it get into the weeds of [`scikit-learn`](https://scikit-learn.org/stable/) or other machine learning libraries that `Python` is known for (fear not: we'll delve into machine learning in a few short months).


## Loading Libraries

To kick things off, let's load our essential libraries.

In [None]:
# We use canonical naming conventions for our libraries and submodules:

import scipy as sp
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from statsmodels.formula.api import ols, logit

# An experimental submodule that brings the "grammar of graphics" into seaborn:

import seaborn.objects as so

### Note

If you're using [Google Colab](https://colab.google/), you can "mount" your Google Drive folders onto a Colab session to save plots, data sets and so on. To programmatically mount your Drive folder(s), run the following lines:

In [None]:
from google.colab import drive
drive.mount('/drive')

# Loading Data

To read in different _kinds_ of input data, we will use methods (i.e., functions) from the `pandas` library. For today's session, we will largely work with a dataset that we already encountered during the [<font face="Inconsolata" size=4.5> PopAgingDataViz</font>](https://popagingdataviz.com/) workshop —
 [gapminder](https://jennybc.github.io/gapminder/).






 ### Note

 If you're using Jupyter _locally_, you will want to:

 1. Clone/download our companion GitHub repository, [`intro.python.24`](https://github.com/sakeefkarim/intro.python.24).

 2. _Change your working directory_ (via the command line) so that it points to the `intro.python.24` folder you just cloned/downloaded.


 To implement Step 2, feel free to un-annotate the code snippet below and run the cell — but please **remember to adjust the code as needed** (i.e., to reflect where `intro.python.24` is located on your machine).

In [None]:
# cd "THE PATH TO ... /intro.python.24"

### Excel and CSV Files

In [None]:
# From the companion GitHub repository:

gapminder_excel = pd.read_excel("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.xlsx")

gapminder_csv = pd.read_csv("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.csv")

gapminder_excel

# If you have the intro.python.24 folder on your machine:

# gapminder_excel = pd.read_excel("data/gapminder.xlsx")

# gapminder_csv = pd.read_csv("data/gapminder.csv")


### Stata Files

In [None]:
# From the companion GitHub repository:

gapminder_dta = pd.read_stata("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.dta")

# If you have the intro.python.24 folder on your machine:

# gapminder_dta = pd.read_stata("data/gapminder.dta")


### SPSS Files

In [None]:
# If you want to read in SPSS files, you may want to launch this notebook locally.

# Then, make sure you have pyreadstat installed within your conda environment — say, by running

# conda install pyreadstat

# In your terminal. You can then run the code below:

# gapminder_spss = pd.read_spss("data/gapminder.sav")


### R Files

In [None]:
# In Google Colab:

!pip install pyreadr

# Locally, install pyreadr within your conda environment — e.g.,:

# conda install pyreadr

In [None]:
# Here, the alias is idiosyncratic (no canonical conventions for pyreadr)

import pyreadr as pyr

# Loading R files from GitHub

rds_url = "https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.rds"

destination = '/drive/My Drive/Colab/gapminder.rds'

pyr.download_file(rds_url, destination)

gapminder_r = pyr.read_r(destination)

# Checking to see which objects are available:

print(gapminder_r.keys())

# Only none, ergo:

gapminder_rds = gapminder_r[None]

# Importing R files locally:

# gapminder_r = pyr.read_r("data/gapminder.rds")

# gapminder_rds = gapminder_r[None]

# Exploring the Data

### Basic Code

In [None]:
# For simplicity's sake, let's call our data frame of interest "gapminder"

gapminder = gapminder_excel

# Now, we'll use the "head" method to get more information

gapminder.head()

?gapminder.head

How can we look at the first **ten** observations in `gapminder`?

In [None]:
# What if we want to take a peak at the last few observations in our dataset?

gapminder.tail()

In [None]:
# Dimensions of our data frame:

gapminder.shape

# List the columns/variables we have at our disposal:

gapminder.columns

# What *type* of variable is "continent"?

print(gapminder['continent'].dtype)

# Let's look at all the "dtypes":

gapminder.dtypes

### Exploratory Visualizations

Here's a very useful way to explore your data via `seaborn`. We'll be returning to `seaborn` at some point this semester.

In [None]:
# This unlocks seaborn's basic 'dark grid' theme:

sns.set_theme()

# Other seaborn themes: http://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles

sns.pairplot(gapminder)

# The 'hue' parameter (for most seaborn functions) allows analysts to condition on a
# variable of interest:

sns.pairplot(gapminder, hue = 'continent')

# Descriptive Statistics

In [None]:
# Basic Descriptives

gapminder.describe()

# Include non-numeric variables:

gapminder.describe(include='all')

# ONLY include non-numeric variables:

gapminder.describe(exclude=[np.number])

# OR:

gapminder.describe(include=['object'])

# Calculate frequency of different continents in data set:

gapminder['continent'].value_counts(normalize=True)

In [None]:
# Mean of numeric variables for first 50 rows:

gapminder.head(50).mean()

# Zeroing-in on numeric variables:

gapminder.head(50).mean(numeric_only = True)

gapminder.select_dtypes(exclude=['object']).head(50).mean()

gapminder.select_dtypes(include=[np.number]).head(50).mean()

 #We can also store the subsetted dataframe as an object before applying the mean function:

gapminder_50 = gapminder.select_dtypes(exclude=['object']).head(50)

gapminder_50.mean(numeric_only = True)

# Are these statistics informative at all? Hint: they're not! But why?

# We'll produce more informative descriptives below — after we've touched on
# filtering and reshaping.

In [None]:
# Group data by continent:

gapminder_continent = gapminder.groupby(['continent'])

gapminder_continent.mean(numeric_only = True)

# Grouped averages for life expectancy:

gapminder_continent[['lifeExp']].mean(numeric_only=True)

# Grouped averages for life expectancy (descending):

gapminder_continent[['lifeExp']].mean().sort_values(by = 'lifeExp', ascending=False)

# Filtering Observations, Column Selection

In [None]:
# Select column — life expectancy:

gapminder['lifeExp']

gapminder.lifeExp

gapminder[['lifeExp']]

# Selecting three columns: country, life expectancy, continent

gapminder[['continent', 'country', 'lifeExp']]

# What *kind* of variables are we dealing with?

gapminder[['continent', 'country', 'lifeExp']].dtypes

In [None]:
gapminder_select = gapminder[ gapminder['lifeExp'] >= 80 ]

# Dimensions (original data frame)

print(gapminder.shape)

# Dimensions (subsetted data frame)

print(gapminder_select.shape)

In [None]:
gapminder_africa = gapminder[ gapminder['continent'] == 'Africa']

gapminder_africa.head()

gapminder_africa.reset_index(drop=True).head()

In [None]:
# Subsetting rows based on *index label*:

gapminder.loc[0:10]

# Subsetting rows based on *integer position*:

gapminder.iloc[0:10]

In [None]:
# First row, first column — pursuant to row/column position

gapminder.iloc[0, 0]

# Subset data frame based on labels:

gapminder.loc[0, 'country']

# Produce data frame based on row/column labels:

gapminder.loc[[0], ['country']]

# Produce data frame based on row/column positions:

gapminder.iloc[[0], [0]]

In [None]:
# Last row:

gapminder.iloc[-1]

# Last column:

gapminder.iloc[:, [-1]]

# First and sixth rows + second and fourth columns:

gapminder.iloc[[0, 5], [1, 3]]

# Cleaning (Recoding, New Variables *etc.*)

In [None]:
# Let's zero-in on the latest year in the gapminder data set:

gapminder_07 = gapminder.query("year == 2007 and continent != 'Oceania'").reset_index(drop=True)

# The year variable's no longer necessary! Let's drop it:

gapminder_07.drop(columns='year',
                  # The inplace parameter executes the command "quietly" — i.e.,
                  # no output is displayed and the results do not have to be stored in a
                  # new object:
                  inplace = True)

# Let's rename some columns

gapminder_07.rename(columns={# Existing column name : new column name
                   'gdpPercap':'gdp_pc',
                   'lifeExp':'le'}, inplace=True)

# For illustrative purposes, let's generate new variables that take the log of
# per capita GDP/population:

gapminder_07['ln_gdp'] = np.log(gapminder_07['gdp_pc'])

gapminder_07['ln_pop'] = np.log(gapminder_07['pop'])

# Discretizing population size and per capita GDP into new quintile measures:

gapminder_07['pop_quintile'] = pd.qcut(gapminder_07['pop'], q = 5, labels=False) + 1

gapminder_07['gdp_quintile'] = pd.qcut(gapminder_07['gdp_pc'], q = 5, labels=False) + 1

# Using quintile measures to create crude binary indicators

gapminder_07['size'] = np.where(gapminder_07['pop_quintile'] > 4, 'Large', 'Small or Medium')

# gapminder_07['size'] = gapminder_07['size'].map({'Large': 1, 'Small or Medium': 0})

gapminder_07['wealth'] = np.where(gapminder_07['gdp_quintile'] > 3, 'Top 2 Quintiles', 'Bottom 3 Quintiles')

# Transforming quintile measures into categorical variables

gapminder_07[['pop_quintile', 'gdp_quintile']] = gapminder_07[['pop_quintile', 'gdp_quintile']].astype('category')


# Fitting Crude Models

These models are, of course, for illustrative purposes—to wit, don't read into the results!

In [None]:
mod_gapminder_1 = ols("le ~ ln_pop + ln_gdp + continent", data=gapminder_07).fit()

# Text

mod_gapminder_1.summary()

# LaTeX Code

print(mod_gapminder_1.summary().as_latex())

# CSV

mod_gapminder_1.summary().as_csv()

# Zooming in on parameters

print(mod_gapminder_1.summary().tables[1])

# Save results as an Excel file

mod1_results = mod_gapminder_1.summary().tables[1]

pd.DataFrame(mod1_results).to_excel('/drive/My Drive/Colab/mod1_results.xlsx')


In [None]:
mod_gapminder_2 = ols("le ~ ln_pop + ln_gdp * continent", data=gapminder_07).fit()

# Zooming in on parameters

print(mod_gapminder_2.summary().tables[1])

In [None]:
gapminder_07['size'] = gapminder_07['size'].map({'Large': 1, 'Small or Medium': 0})


mod_gapminder_3 = logit("size ~ ln_gdp + le", data=gapminder_07).fit()


print(mod_gapminder_3.summary().tables[1])

# Exercises

### Work With the `penguins` Data Frame


Now, it's your turn to wrangle (or "pre-process") data in `Python`. To do so, you'll be working with *another* dataset you encountered during the [<font face="Inconsolata" size=4.5> PopAgingDataViz</font>](https://popagingdataviz.com/) workshop — i.e., the `penguins` data frame from the wonderful [`{palmerpenguins}`](https://allisonhorst.github.io/palmerpenguins/) package.

You can find different versions of `penguins` via [`intro.python.24/data`](https://github.com/sakeefkarim/intro.python.24/tree/main/data). For those of you who are new to `Python`, feel free to download some helpful cheat sheets from [`intro.python.24/code/cheat sheets`](https://github.com/sakeefkarim/intro.python.24/tree/main/code/cheat%20sheets).



### Tasks

1. Load `penguins` into your `Jupyter Notebook`.

2. What are the *dimensions* of the `penguins` data frame (e.g., rows and columns)?

3. List all the columns in `penguins` and what *kinds* of variables they are.

4. Show the first 10 rows in the dataset.

5. Show the last 20 rows in the dataset.

6. Visualize the relationships between all numeric variables in `penguins` — but make sure these visual summaries vary by the `species` variable.

7. Generate basic descriptive statistics for all numeric variables.

8. Generate basic descriptive statistics for all discrete variables.

9. Isolate the first 35 observations in `penguins` and display the mean values associated with *all numeric variables* (for these 35 observations).

10. Show the mean value for `bill_length_mm` at different levels of `species` (more concretely, produce grouped averages for different penguins species). Sort the rows by descending (mean) values of `bill_length_mm`.

11. Create a new data frame that:

  + Only includes data from the latest `year` in `penguins`.
  + Resets the index value for each observation.
  + Removes the `year` variable.

12. Show the rows corresponding to index values of `0` and `21` in your new data frame.

13. Show the 10th and 20th observation in your new data frame.

14. Rename at least two of the variables in your new data frame; store the results.

15. Using your custom data frame, generate new numeric and discrete variables; once again, store the results.

16. Estimate at least two regression models using your custom data frame.

17. Export model results as `.xlsx` files.

18. Try to reproduce your results (for questions 1-17) using [`reticulate`](https://rstudio.github.io/reticulate/) in `R`.