# Describing the same data with summary statistics

### Python and R Setup

This setup allows you to use *Python* and *R* in the same notebook.

To set up a similar notebook, see quickstart instructions here:

https://github.com/dmil/jupyter-quickstart



In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

### Import packages in R

In [6]:
%%R

require('tidyverse')


### Read data

In [7]:
%%R

# Read data
df <- read_csv('housing_data.csv')

Rows: 189 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): borough
dbl (11): zip, population, pct_hispanic_or_latino, pct_asian, pct_american_i...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [8]:
%%R

df

# A tibble: 189 × 12
     zip popul…¹ borough pct_h…² pct_a…³ pct_a…⁴ pct_b…⁵ pct_w…⁶ pct_n…⁷ pct_s…⁸
   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 11368  112088 QUEENS    76.5    11.6     0.09    7.12    3.73    0       0.37
 2 11385  107796 QUEENS    45.0     6.46    0.03    1.77   45.3     0       0.32
 3 11211  103123 BROOKL…   24.1     5.53    0.19    3.56   64.2     0.02    0.29
 4 11208  101313 BROOKL…   40.6     6.01    0.03   48.9     2.71    0       0.76
 5 10467  101255 BRONX     52.1     5.68    0.35   30.4     9.24    0.02    0.76
 6 11236  100844 BROOKL…    7.66    2.58    0.09   84.0     4.13    0       0.44
 7 11226   99558 BROOKL…   16.4     3.07    0.15   64.8    13.0     0       0.71
 8 11373   94437 QUEENS    41.7    48.5     0.34    1.62    6.18    0.01    0.5 
 9 11234   93534 BROOKL…   11.1     7.19    0.13   42.9    37.2     0       0.26
10 11220   93170 BROOKL…   41.3    41.4     0.07    1.9    13.9     0       0.25
# … wit

### R syntax - getting a column
in R, the `$` lets you grab a column of a dataframe

in Python this might be something like `df["pct_below_poverty"]

In [9]:
%%R

# Get a column
df$pct_below_poverty

  [1] 19.69 10.68 25.22 25.68 25.20 12.53 15.74 13.10  7.51 27.25 37.16 16.29
 [13] 14.27 28.50  9.13 34.96 34.92 34.92 19.92 34.02 10.81 21.34 13.64 37.77
 [25] 18.65 21.76 27.14 29.75 17.12 32.86 20.80 14.55 19.09 34.41 36.23 17.82
 [37] 37.14 27.36  8.43 18.51 15.60 15.60 16.71 16.71  7.09 19.42 31.16 10.97
 [49] 22.69 25.27 11.76  9.46 14.83 12.35 20.76 12.79  7.96  5.88 13.45 21.97
 [61] 26.48 35.61  6.01  6.01 17.62 23.54  7.40 16.98 17.25 11.89  9.20  9.24
 [73] 19.84  8.20  9.92 15.38 29.09 33.55 11.51 36.13 10.76 27.67  5.30 28.12
 [85] 12.80 21.04 12.24 12.07  5.21 12.73 13.69 11.31 12.32 18.21 38.14 13.26
 [97]  6.77  3.15 10.37 14.53 21.56  5.32 23.14  8.72 15.03 18.67  7.53 14.71
[109]  8.18 12.63 43.12 16.36  9.33 10.18  9.27  7.39 36.77 11.51 10.82  6.63
[121]  6.37 11.56 15.20 16.85 13.37  3.98 27.58 27.58  5.79  9.20 16.08  5.88
[133]  4.36 10.73 14.33  9.68  3.17 25.47  8.03 23.01  8.07 14.11 20.15 10.14
[145]  8.54 13.06 16.41  9.90 11.08 18.57  9.25  2.45 25.60 11.1

### R syntax - the almighty pipe `%>%`

The pipe (`%>%`) takes the output of the previous function and makes it the input to the next

https://towardsdatascience.com/an-introduction-to-the-pipe-in-r-823090760d64


In [10]:
%%R 

df$pct_below_poverty %>% 
    mean()

# Equivalent to...
mean(df$pct_below_poverty)

[1] 15.87868


In [11]:
%%R

df$pct_below_poverty %>% 
    median()

[1] 13.06


In [12]:
%%R 

df$pct_below_poverty %>% 
    var()

[1] 108.451


In [13]:
%%R 

df$pct_below_poverty %>% 
    sd()

[1] 10.41398


In [14]:
%%R 

df %>% 
    group_by(borough) %>%
    summarize(
        mean=mean(pct_below_poverty), 
        median=median(pct_below_poverty), 
        standard_deviation=sd(pct_below_poverty))

# A tibble: 5 × 4
  borough        mean median standard_deviation
  <chr>         <dbl>  <dbl>              <dbl>
1 BRONX          26.7  28.7               11.6 
2 BROOKLYN       18.6  17.2                8.20
3 MANHATTAN      13.8  11.0                9.41
4 QUEENS         12.0  10.7                9.06
5 STATEN ISLAND  12.0   9.24               6.58


**👉 Try It**
Compare the summary statistics to the distributions in your previous assignment. What story do they tell? What stories do they obscure? Why was it important to plot the data in the case of this dataset? What did you gain from plotting the `pct_below_poverty` distribution in various different ways?

> I think in general, without plotting, we miss the density of the distribution. For example, while Manhattan's median poverty rate is 11%, we see from the charts there are a number of neighborhoods with poverty rates around or above 20%. Similarly, with Bronx, which has the highest median poverty rates, there were a number of neighborhooods with higher than 30% poverty rate too.

>Without plotting the charts, we wouldn't be able to spot outliers either, such as the more than 60% poverty rate in Jamaica, Queens — which otherwise looks like a borough with a relatively low poverty rate — and Mott Haven and Hunts Point in the Bronx.

>However, I think with these summary statistics, it's easier to see that Bronx has a wider spread of poor and non-poor neighborhoods, looking at the standard deviation. 

>I think the density plot and violin plot were especially helpful in seeing the spread of the data, and not just the peak (would the peak be the median?), while the box and whiskers plot helps you to see the lower and upper quartiles, which could be helpful for audiences to compare something like wealth or wages data to see where they lie. 