# Statistical Methods in Pandas - Lab

## Introduction

In this lesson you'll get some hands-on experience using some of the key summary statistics methods in Pandas.

## Objectives
You will be able to:

* Understand and use the df.describe() and df.info() summary statistics methods
* Use built-in Pandas methods for calculating summary statistics (.mean(), .std(), .count(), .sum(), .mean(), .median(), .std(), .var() and .quantile())
* Apply a function to every element in a Series or DataFrame using s.apply() and df.applymap()


## Getting Started

For this lab, we'll be working with a dataset containing information on various lego datasets.  You will find this dataset in the file `lego_sets.csv`.  

In the cell below:

* Import pandas and set the standard alias of `pd`
* Load in the `lego_sets.csv` dataset using the `read_csv()` function
* Display the head of the DataFrame to get a feel for what we'll be working with

In [1]:
import pandas as pd
df = pd.read_csv("lego_sets.csv")

In [2]:
df.head(2)

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US


In [7]:
df5= df[['prod_id','piece_count']]
df5[df5.prod_id]('piece_count')

Unnamed: 0,prod_id,piece_count
0,75823.0,277.0
1,75822.0,168.0
2,75821.0,74.0
3,21030.0,1032.0
4,21035.0,744.0
5,21039.0,597.0
6,21028.0,598.0
7,21029.0,780.0
8,21034.0,468.0
9,21033.0,444.0


## Getting DataFrame-Level Statistics

We'll begin by getting some overall summary statistics on the dataset.  There are two ways we'll get this information-- `.info()` and `.describe()`.

### Using `.info()`

The `.info()` method provides us metadata on the DataFrame itself.  This allows to answer questions such as:

* What data type does each column contain?
* How many rows are in my dataset? 
* How many total non-missing values does each column contain?
* How much memory does the DataFrame take up?

In the cell below, call our DataFrame's `.info()` method. 

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages                 12261 non-null object
list_price           12261 non-null float64
num_reviews          10641 non-null float64
piece_count          12261 non-null float64
play_star_rating     10486 non-null float64
prod_desc            11884 non-null object
prod_id              12261 non-null float64
prod_long_desc       12261 non-null object
review_difficulty    10206 non-null object
set_name             12261 non-null object
star_rating          10641 non-null float64
theme_name           12258 non-null object
val_star_rating      10466 non-null float64
country              12261 non-null object
dtypes: float64(7), object(7)
memory usage: 1.3+ MB


#### Interpreting the Results

Read the output above, and then answer the following questions:

How many total rows are in this DataFrame?  How many columns contain numeric data? How many contain categorical data?  Identify at least 3 columns that contain missing values. 

Write your answer below this line:
________________________________________________________________________________________________________________________________



Answer question here...

## Using `.describe()`

Whereas `.info()` provides statistics about the DataFrame itself, `.describe()` returns output containing basic summary statistics about the data contained with the DataFrame.  

In the cell below, call the DataFrame's `.describe()` method. 

In [9]:
df.describe()

Unnamed: 0,list_price,num_reviews,piece_count,play_star_rating,prod_id,star_rating,val_star_rating
count,12261.0,10641.0,12261.0,10486.0,12261.0,10641.0,10466.0
mean,65.141998,16.826238,493.405921,4.337641,59836.75,4.514134,4.22896
std,91.980429,36.368984,825.36458,0.652051,163811.5,0.518865,0.660282
min,2.2724,1.0,1.0,1.0,630.0,1.8,1.0
25%,19.99,2.0,97.0,4.0,21034.0,4.3,4.0
50%,36.5878,6.0,216.0,4.5,42069.0,4.7,4.3
75%,70.1922,13.0,544.0,4.8,70922.0,5.0,4.7
max,1104.87,367.0,7541.0,5.0,2000431.0,5.0,5.0


#### Interpreting the Results

The output contains descriptive statistics corresponding to the columns.  Use these to answer the following questions:

How much is the standard deviation for piece count?  How many pieces are in the largest lego set?  How many in the smallest lego set? What is the median `val_star_rating`?

________________________________________________________________________________________________________________________________

Answer questions here...

## Getting Summary Statistics

Pandas also allows us to easily compute individual summary statistics using built-in methods.  Next, we'll get some practice using these methods. 

In the cell below, compute the median value of the `star_rating` column.

In [10]:
df["star_rating"].median()

4.7

Next, get a count of the total number of values in `play_star_rating`.

In [13]:
print(df.play_star_rating.nunique())
df["play_star_rating"].count()

30


10486

Now, compute the standard deviation of the `list_price` column.

In [14]:
df.list_price.std()

91.9804293059243

If we bought every single lego set in this dataset, how many pieces would we have?  Use the `.sum()` method on the correct column to compute this. 

In [2]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages                 12261 non-null object
list_price           12261 non-null float64
num_reviews          10641 non-null float64
piece_count          12261 non-null float64
play_star_rating     10486 non-null float64
prod_desc            11884 non-null object
prod_id              12261 non-null float64
prod_long_desc       12261 non-null object
review_difficulty    10206 non-null object
set_name             12261 non-null object
star_rating          10641 non-null float64
theme_name           12258 non-null object
val_star_rating      10466 non-null float64
country              12261 non-null object
dtypes: float64(7), object(7)
memory usage: 1.3+ MB


In [4]:
df.piece_count.sum()

6049650.0

In [21]:
df.groupby('prod_id').piece_count.mean().sum().round()

319071.0

In [19]:
df2 = df.groupby('prod_id').piece_count.unique()
df2.head()


prod_id
630.0      [1.0]
2304.0     [1.0]
7280.0     [2.0]
7281.0     [2.0]
7499.0    [24.0]
Name: piece_count, dtype: object

In [22]:
df2 = df.groupby('prod_id').piece_count.unique()

list_df2 = list(x[0] for x in list(df2))
print(sum(list_df2))

319071.0


In [28]:
# df.groupby('prod_id').piece_count.count()

In [7]:
#df.groupby(['prod_id', 'piece_count'])['piece_count'].mean()

6049650.0

Now, let's try getting the value for the 90% quantile.  Do this in the cell below.

In [33]:
df.quantile(q = .9)

list_price            136.2971
num_reviews            38.0000
piece_count          1077.0000
play_star_rating        5.0000
prod_id             75531.0000
star_rating             5.0000
val_star_rating         5.0000
Name: 0.9, dtype: float64

## Getting Summary Statistics on Categorical Data

For obvious reasons, most of the methods we've used so far only work with numerical data--there's no way to calculate the standard deviation of a column containing string values. However, there are some things that we can discover about columns containing categorical data. 

In the cell below, get the `.unique()` values contained within the `review_difficulty` column. 

In [34]:
df.review_difficulty.unique()

array(['Average', 'Easy', 'Challenging', 'Very Easy', nan,
       'Very Challenging'], dtype=object)

Now, let's get the `value_counts` for this column, to see how common each is. 

In [36]:
df.head(1)

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US


In [35]:
df.review_difficulty.value_counts()

Easy                4236
Average             3765
Very Easy           1139
Challenging         1058
Very Challenging       8
Name: review_difficulty, dtype: int64

As you can see, these provide us quick and easy ways to get information on columns containing categorical information.  


## Using `.applymap()`

When working with pandas DataFrames, we can quickly compute functions on the data contained by using the `applymap()` function and passing in a lambda function. 

For instance, we can use `applymap()` to return a version of the DataFrame where every value has been converted to a string.

In the cell below:

* Call our DataFrame's `.applymap()` function and pass in `lambda x: str(x)`
* Call our new `string_df` object's `.info()` method to confirm that everything has been cast to a string

In [40]:
string_df = df.applymap(lambda x: str(x))

In [41]:
string_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages                 12261 non-null object
list_price           12261 non-null object
num_reviews          12261 non-null object
piece_count          12261 non-null object
play_star_rating     12261 non-null object
prod_desc            12261 non-null object
prod_id              12261 non-null object
prod_long_desc       12261 non-null object
review_difficulty    12261 non-null object
set_name             12261 non-null object
star_rating          12261 non-null object
theme_name           12261 non-null object
val_star_rating      12261 non-null object
country              12261 non-null object
dtypes: object(14)
memory usage: 1.3+ MB


Note that everything--even the `NaN` values, has been cast to a string in the example above. 

Note that for pandas Series objects (such as a single column in a DataFrame), we can do the same thing using the `apply()` method.  

This is just one example of how we can quickly compute custom functions on our DataFrame--this will become especially useful when we learn how to **_normalize_** our datasets in a later section!

## Summary

In this lab, we learned how to:

* Understand and use the df.describe() and df.info() summary statistics methods
* Use built-in Pandas methods for calculating summary statistics (.mean(), .std(), .count(), .sum(), .mean(), .median(), .std(), .var() and .quantile())
* Apply a function to every element in a Series or DataFrame using s.apply() and df.applymap()