# Describing the same data with summary statistics

### Python and R Setup

This setup allows you to use *Python* and *R* in the same notebook.

To set up a similar notebook, see quickstart instructions here:

https://github.com/dmil/jupyter-quickstart



In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

### Import packages in R

In [3]:
%%R

require('tidyverse')


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Loading required package: tidyverse


### Read data

In [4]:
%%R

# Read data
df <- read_csv('housing_data.csv')

Rows: 189 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): borough
dbl (11): zip, population, pct_hispanic_or_latino, pct_asian, pct_american_i...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


### R syntax - getting a column
in R, the `$` lets you grab a column of a dataframe

in Python this might be something like `df["pct_below_poverty"]

In [5]:
%%R

# Get a column
df$pct_below_poverty

  [1] 19.69 10.68 25.22 25.68 25.20 12.53 15.74 13.10  7.51 27.25 37.16 16.29
 [13] 14.27 28.50  9.13 34.96 34.92 34.92 19.92 34.02 10.81 21.34 13.64 37.77
 [25] 18.65 21.76 27.14 29.75 17.12 32.86 20.80 14.55 19.09 34.41 36.23 17.82
 [37] 37.14 27.36  8.43 18.51 15.60 15.60 16.71 16.71  7.09 19.42 31.16 10.97
 [49] 22.69 25.27 11.76  9.46 14.83 12.35 20.76 12.79  7.96  5.88 13.45 21.97
 [61] 26.48 35.61  6.01  6.01 17.62 23.54  7.40 16.98 17.25 11.89  9.20  9.24
 [73] 19.84  8.20  9.92 15.38 29.09 33.55 11.51 36.13 10.76 27.67  5.30 28.12
 [85] 12.80 21.04 12.24 12.07  5.21 12.73 13.69 11.31 12.32 18.21 38.14 13.26
 [97]  6.77  3.15 10.37 14.53 21.56  5.32 23.14  8.72 15.03 18.67  7.53 14.71
[109]  8.18 12.63 43.12 16.36  9.33 10.18  9.27  7.39 36.77 11.51 10.82  6.63
[121]  6.37 11.56 15.20 16.85 13.37  3.98 27.58 27.58  5.79  9.20 16.08  5.88
[133]  4.36 10.73 14.33  9.68  3.17 25.47  8.03 23.01  8.07 14.11 20.15 10.14
[145]  8.54 13.06 16.41  9.90 11.08 18.57  9.25  2.45 25.60 11.1

### R syntax - the almighty pipe `%>%`

The pipe (`%>%`) takes the output of the previous function and makes it the input to the next

https://towardsdatascience.com/an-introduction-to-the-pipe-in-r-823090760d64


In [6]:
%%R 

df$pct_below_poverty %>% 
    mean()

# Equivalent to...
mean(df$pct_below_poverty)

[1] 15.87868


In [7]:
%%R

df$pct_below_poverty %>% 
    median()

[1] 13.06


In [8]:
%%R 

df$pct_below_poverty %>% 
    var()

[1] 108.451


In [9]:
%%R 

df$pct_below_poverty %>% 
    sd()

[1] 10.41398


In [10]:
%%R 

df %>% 
    group_by(borough) %>%
    summarize(
        mean=mean(pct_below_poverty), 
        median=median(pct_below_poverty), 
        standard_deviation=sd(pct_below_poverty))

# A tibble: 5 × 4
  borough        mean median standard_deviation
  <chr>         <dbl>  <dbl>              <dbl>
1 BRONX          26.7  28.7               11.6 
2 BROOKLYN       18.6  17.2                8.20
3 MANHATTAN      13.8  11.0                9.41
4 QUEENS         12.0  10.7                9.06
5 STATEN ISLAND  12.0   9.24               6.58


**👉 Try It**
Compare the summary statistics to the distributions in your previous assignment. What story do they tell? What stories do they obscure? Why was it important to plot the data in the case of this dataset? What did you gain from plotting the `pct_below_poverty` distribution in various different ways?

> Personally, I see the most interesting story in this data in the differences within the boroughs. I think the most eye-opening moment for me was seeing the Distribution of Population Below Poverty Line by Borough as a dot plot. It is literally serving stories for me! 

> At the moment, I'm most interested in one zip code in Queens that has a population with a significantly higher poverty rate than anywhere else in Queens or NYC in general. What is happening there? After plotting this data, I could easily write an analysis on some kind of "geography of New York poverty" and describe the differences between areas.

> New York is a city that has spread public housing across all boroughs, which might be one factor explaining why there are variations within certain areas. However, there are still differences between neighborhoods—what are they?

> But as I already said, I'm still most interested in this one neighborhood in Queens. So if I were to start working on a story based on this dataset, I would begin with a working title like: "This is the Poorest Place in New York City – What Explains the Extraordinary Poverty Rate in One Neighborhood of One of the Most Iconic Cities in the World?"

> I would go to this area, conduct interviews, and then mix my data analysis with the everyday experiences of New Yorkers living in Queens. And, of course, to have a broader understanding of the phenomenon, I would interview city officials and academics who have researched this issue. And I couldn'd have got this far whithout playing with the dataset in various ways! 