<a href="https://colab.research.google.com/github/sakeefkarim/intro.python.24/blob/main/code/introduction_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to `Python` for Social Research

[Sakeef M. Karim](https://www.sakeefkarim.com/)

sakeef.karim@nyu.edu

## Preliminaries

This notebook is designed to provide an introduction to select libraries in `Python` that are *essential* for data science—inclusive of [`pandas`](https://pandas.pydata.org/) (for data wrangling) and [`seaborn`](https://seaborn.pydata.org/) (for data visualization). More concretely, it will offer some basic code for (i) manipulating tabular data frames; (ii) visualizing descriptive statistics; and (iii) estimating simple regression models. However, this notebook **will not** provide an exhaustive overview of the affordances of `Python` for research in the social and behavioural sciences, nor will it get into the weeds of [`scikit-learn`](https://scikit-learn.org/stable/) or other machine learning libraries that `Python` is known for (fear not: we'll delve into machine learning in a few short months).


## Loading Libraries

To kick things off, let's load our essential libraries.

In [207]:
# We use canonical naming conventions for our libraries and submodules:

import scipy as sp
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from statsmodels.formula.api import ols, logit

# An experimental submodule that brings the "grammar of graphics" into seaborn:

import seaborn.objects as so

### Note

If you're using [Google Colab](https://colab.google/), you can "mount" your Google Drive folders onto a Colab session to save plots, data sets and so on. To programmatically mount your Drive folder(s), run the following lines:

In [None]:
from google.colab import drive
drive.mount('/drive')

# Loading Data

To read in different _kinds_ of input data, we will use methods (i.e., functions) from the `pandas` library. For today's session, we will largely work with a dataset that we already encountered during the [<font face="Inconsolata" size=4.5> PopAgingDataViz</font>](https://popagingdataviz.com/) workshop —
 [gapminder](https://jennybc.github.io/gapminder/).






 ### Note

 If you're using Jupyter _locally_, you will want to:

 1. Clone/download our companion GitHub repository, [`intro.python.24`](https://github.com/sakeefkarim/intro.python.24).

 2. _Change your working directory_ (via the command line) so that it points to the `intro.python.24` folder you just cloned/downloaded.


 To implement Step 2, feel free to un-annotate the code snippet below and run the cell — but please **remember to adjust the code as needed** (i.e., to reflect where `intro.python.24` is located on your machine).

In [None]:
# cd "THE PATH TO ... /intro.python.24"

### Excel and CSV Files

In [None]:
# From the companion GitHub repository:

gapminder_excel = pd.read_excel("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.xlsx")

gapminder_csv = pd.read_csv("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.csv")

# If you have the intro.python.24 folder on your machine:

# gapminder_excel = pd.read_excel("data/gapminder.xlsx")

# gapminder_csv = pd.read_csv("data/gapminder.csv")


### Stata Files

In [None]:
# From the companion GitHub repository:

gapminder_dta = pd.read_stata("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.dta")

# If you have the intro.python.24 folder on your machine:

# gapminder_dta = pd.read_stata("data/gapminder.dta")


### SPSS Files

In [None]:
# If you want to read in SPSS files, you may want to launch this notebook locally.

# Then, make sure you have pyreadstat installed within your conda environment — say, by running

# conda install pyreadstat

# In your terminal. You can then run the code below:

# gapminder_spss = pd.read_spss("data/gapminder.sav")


### R Files

In [None]:
# In Google Colab:

!pip install pyreadr

# Locally, install pyreadr within your conda environment — e.g.,:

# conda install pyreadr

In [None]:
# Here, the alias is idiosyncratic (no canonical conventions for pyreadr)

import pyreadr as pyr

# Loading R files from GitHub

rds_url = "https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.rds"

destination = '/drive/My Drive/Colab/gapminder.rds'

pyr.download_file(rds_url, destination)

gapminder_r = pyr.read_r(destination)

# Checking to see which objects are available:

print(gapminder_r.keys())

# Only none, ergo:

gapminder_rds = gapminder_r[None]

# Importing R files locally:

# gapminder_r = pyr.read_r(data/gapminder.rds)

# gapminder_rds = gapminder_r[None]

# Exploring the Data

### Basic Code

In [None]:
# For simplicity's sake, let's call our data frame of interest "gapminder"

gapminder = gapminder_excel

# Now, we'll use the "head" method to get more information

gapminder.head()

#?gapminder.head

How can we look at the first **ten** observations in `gapminder`?

In [None]:
# What if we want to take a peak at the last few observations in our dataset?

gapminder.tail()

In [None]:
# Dimensions of our data frame:

gapminder.shape

# List the columns/variables we have at our disposal:

gapminder.columns

# What *type* of variable is "continent"?

gapminder['continent'].dtype

# Let's look at all the "dtypes":

gapminder.dtypes

### Exploratory Visualizations

In [None]:
# This unlocks seaborn's basic 'dark grid' theme:

sns.set_theme()

# Other seaborn themes: http://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles

sns.pairplot(gapminder)

# The 'hue' parameter (for most seaborn functions) allows analysts to condition on a
# variable of interest:

sns.pairplot(gapminder, hue = 'continent')

# Descriptive Statistics

In [269]:
# Basic Descriptives

gapminder.describe()

# Include non-numeric variables:

gapminder.describe(include='all')

# ONLY include non-numeric variables:

gapminder.describe(exclude=[np.number])

# OR:

gapminder.describe(include=['object'])

# Calculate frequency of different continents in data set:

gapminder['continent'].value_counts(normalize=True)

size
0       112
1        28
dtype: int64

In [None]:
# Mean of numeric variables for first 50 rows:

gapminder.head(50).mean()

# Zeroing-in on numeric variables:

gapminder.head(50).mean(numeric_only = True)

gapminder.select_dtypes(exclude=['object']).head(50).mean()

gapminder.select_dtypes(include=[np.number]).head(50).mean()

 #We can also store the subsetted dataframe as an object before applying the mean function:

gapminder_50 = gapminder.select_dtypes(exclude=['object']).head(50)

gapminder_50.mean()

# Are these statistics informative at all? Hint: they're not! But why?

# We'll produce more informative descriptives below — after we've touched on
# filtering and reshaping.

In [None]:
# Group data by continent:

gapminder_continent = gapminder.groupby(['continent'])

gapminder_continent.mean()

# Grouped averages for life expectancy:

gapminder_continent[['lifeExp']].mean()

# Grouped averages for life expectancy (descending):

gapminder_continent[['lifeExp']].mean().sort_values(by = 'lifeExp', ascending=False)

# Filtering Observations, Column Selection

In [None]:
# Select column — life expectancy:

gapminder['lifeExp']

gapminder.lifeExp

gapminder[['lifeExp']]

# Selecting three columns: country, life expectancy, continent

gapminder[['continent', 'country', 'lifeExp']]

# What *kind* of variables are we dealing with?

gapminder.dtypes

In [None]:
gapminder_select = gapminder[ gapminder['lifeExp'] > 80 ]

# Dimensions (original data frame)

print(gapminder.shape)

# Dimensions (subsetted data frame)

print(gapminder_select.shape)

In [None]:
gapminder_asia = gapminder[ gapminder['continent'] == 'Asia']

gapminder_asia.head()

gapminder_asia.reset_index(drop=False).head()

In [None]:
# Subsetting rows based on index *label*:

gapminder.loc[0:10]

# Subsetting rows based on *integer position*:

gapminder.iloc[0:10]

In [None]:
# First row, first column — pursuant to row/column position

gapminder.iloc[0, 0]

# Subset data frame based on labels:

gapminder.loc[0, 'country']

# Produce data frame based on row/column labels:

gapminder.loc[0:, ['country']]

# Produce data frame based on row/column positions:

gapminder.iloc[0:, [0]]

In [149]:
# Last row:

gapminder.iloc[-1]

# Last column:

gapminder.iloc[:, [-1]]

# First and sixth rows + second and fourth columns:

gapminder.iloc[[0, 5], [1, 3]]

Unnamed: 0,continent,lifeExp
0,Asia,28.801
5,Asia,38.438


# Cleaning (Recoding, New Variables *etc.*)

In [281]:
# Let's zero-in on the latest year in the gapminder data set:

gapminder_07 = gapminder.query("year == 2007 and continent != 'Oceania'").reset_index(drop=True)

# The year variable's no longer necessary! Let's drop it:

gapminder_07.drop(columns='year',
                  # The inplace parameter executes the command "quietly" — i.e.,
                  # no output is displayed and the results do not have to be stored in a
                  # new object:
                  inplace = True)

# Let's rename some columns

gapminder_07.rename(columns={# Existing column name : new column name
                   'gdpPercap':'gdp_pc',
                   'lifeExp':'le'}, inplace=True)

# For illustrative purposes, let's generate new variables that take the log of
# per capita GDP/population:

gapminder_07['ln_gdp'] = np.log(gapminder_07['gdp_pc'])

gapminder_07['ln_pop'] = np.log(gapminder_07['pop'])

# Discretizing population size and per capita GDP into new quintile measures:

gapminder_07['pop_quintile'] = pd.qcut(gapminder_07['pop'], q = 5, labels=False) + 1

gapminder_07['gdp_quintile'] = pd.qcut(gapminder_07['gdp_pc'], q = 5, labels=False) + 1

# Using quintile measures to create crude binary indicators

gapminder_07['size'] = np.where(gapminder_07['pop_quintile'] > 4, 'Large', 'Small or Medium')

# gapminder_07['size'] = gapminder_07['size'].map({'Large': 1, 'Small or Medium': 0})

gapminder_07['wealth'] = np.where(gapminder_07['gdp_quintile'] > 3, 'Top 2 Quintiles', 'Bottom 3 Quintiles')

# Transforming quintile measures into categorical variables

gapminder_07[['pop_quintile', 'gdp_quintile']] = gapminder_07[['pop_quintile', 'gdp_quintile']].astype('category')


# Fitting Crude Models

These models are for illustrative purposes—to wit, please don't read into the results!

In [227]:
mod_gapminder_1 = ols("le ~ ln_pop + ln_gdp + continent", data=gapminder_07).fit()

# Text

mod_gapminder_1.summary()

# LaTeX Code

mod_gapminder_1.summary().as_latex()

# CSV

mod_gapminder_1.summary().as_csv()

# Zooming in on parameters

print(mod_gapminder_1.summary().tables[1])

# Save results as a CSV file

mod1_results = mod_gapminder_1.summary().tables[1]

pd.DataFrame(mod1_results).to_excel('/drive/My Drive/Colab/mod1_results.xlsx')


In [252]:
mod_gapminder_2 = ols("le ~ ln_pop + ln_gdp * continent", data=gapminder_07).fit()

# Zooming in on parameters

print(mod_gapminder_2.summary().tables[1])

                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                       21.4609      9.127      2.351      0.020       3.405      39.516
continent[T.Americas]           10.6983     15.932      0.671      0.503     -20.820      42.217
continent[T.Asia]                2.4524      9.933      0.247      0.805     -17.196      22.101
continent[T.Europe]             12.8151     20.037      0.640      0.524     -26.824      52.454
ln_pop                           0.0795      0.366      0.217      0.828      -0.644       0.803
ln_gdp                           4.2857      0.835      5.135      0.000       2.635       5.937
ln_gdp:continent[T.Americas]     0.1645      1.825      0.090      0.928      -3.445       3.774
ln_gdp:continent[T.Asia]         0.9151      1.197      0.764      0.446      -1.454       3.284
ln_gdp:continent[T.Europe]    

In [297]:
gapminder_07['size'] = gapminder_07['size'].map({'Large': 1, 'Small or Medium': 0})


mod_gapminder_3 = logit("size ~ ln_gdp + le", data=gapminder_07).fit()


print(mod_gapminder_3.summary().tables[1])

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.7163      1.470     -1.848      0.065      -5.598       0.165
ln_gdp        -0.2046      0.290     -0.706      0.480      -0.773       0.364
le             0.0456      0.035      1.289      0.197      -0.024       0.115


# Exercises