1 Introduction to scales
======

At its core, statistics is about counting and measuring. In order to do both effectively, we have to define scales on which to base our counts. A scale represents the possible values that a variable can have.

**Equal Interval Scales**

Equal interval scales are always consistent. Think of the speed of a car. No matter what speed you're traveling at, a difference of five miles per hour is always five miles per hour. The difference between 60 and 55 miles per hour will always be equal to the difference between 10 and five miles per hour.

**Logarithmic Scales**

Each step on a logarithmic scale represents a different order of magnitude. The Richter scale that measures the strength of earthquakes, for example, is a logarithmic scale. The difference between a 5 and a 6 on the Richter scale is more than the difference between a 4 and 5. This is because each number on the Richter scale represents 10 times the shaking amplitude of the previous number. A 6 on the Richter scale is 10 times more powerful (technically, powerful is the wrong term, but it makes thinking about this easier) than a 5, which is 10 times more powerful than a 4. A 6 is 100 times more powerful than a 4.

**Look out!!!**

We can calculate the mean of the values on an equal interval scale by adding those values, and then dividing by the total number of values. We could do the same for the values on a non-equal interval scale, but the results wouldn't be meaningful, due to the differences between units.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

>```python
car_speeds = [10,20,30,50,20]
earthquake_intensities = [2,7,4,5,8]
```

1. Compute the mean of <span style="background-color: #F9EBEA; color:##C0392B">car_speeds</span>, and assign the result to <span style="background-color: #F9EBEA; color:##C0392B">mean_car_speed</span>.
2. Compute the mean of <span style="background-color: #F9EBEA; color:##C0392B">earthquake_intensities</span>, and assign the result to <span style="background-color: #F9EBEA; color:##C0392B">mean_earthquake_intensities</span>. Note that this value will not be meaningful, because we shouldn't average values on a logarithmic scale this way.

2. Discrete and continuous scale
====

Scales can be either **discrete** or **continuous**.

Think of someone marking down the number of inches a snail crawls every day. The snail could crawl 1 inch, 2 inches, 1.5 inches, 1.51 inches, or any other number, and it would be a valid observation. This is because inches are on a continuous scale, and even fractions of an inch are possible.

Now think of someone counting the number of cars in a parking lot each day. 1 car, 2 cars, and 10 cars are valid measurements, but 1.5 cars isn't valid.

Half of a car isn't a meaningful quantity, because cars are discrete. You can't have 52% of a car - you either have a car, or you don't.

You can still average items on discrete scales, though. You could say  <span style="background-color: #F9EBEA; color:##C0392B">"1.75 cars use this parking lot each day, on average."</span> Any daily value for number of cars, however, would need to be a whole number.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

>```python
day_numbers = [1,2,3,4,5,6,7]
snail_crawl_length = [.5,2,5,10,1,.25,4]
cars_in_parking_lot = [5,6,4,2,1,7,8]
```


1. Make a line plot with <span style="background-color: #F9EBEA; color:##C0392B">day_numbers</span> on the x axis and <span style="background-color: #F9EBEA; color:##C0392B">snail_crawl_length</span> on the y axis.
2. Make a line plot with <span style="background-color: #F9EBEA; color:##C0392B">day_numbers</span> on the x axis and <span style="background-color: #F9EBEA; color:##C0392B">cars_in_parking_lot</span> on the y axis.

3. Understanding Scale Starting Points
===

Some scales use the zero value in different ways. Think of the number of cars in a parking lot. Zero cars in the lot means that there are absolutely no cars at all, so absolute zero is at 0 cars. You can't have negative cars.

Now, think of degrees Fahrenheit. Zero degrees doesn't mean that there isn't any warmth; the degree scale can also be negative, and absolute zero (when there is no warmth at all) is at -459.67 degrees.

Scales with absolute zero points that aren't at 0 don't enable us to take meaningful ratios. For example, if four cars parked in the lot yesterday and eight park today, I can safely say that twice as many cars are in the lot today.

However, if it was 32 degrees Fahrenheit yesterday, and it's 64 degrees today, I can't say that it's twice as warm today as yesterday.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

>```python
fahrenheit_degrees = [32, 64, 78, 102]
yearly_town_population = [100,102,103,110,105,120]
```
tip: use list comprehension


1. Convert the values in <span style="background-color: #F9EBEA; color:##C0392B">fahrenheit_degrees</span> so that absolute zero is at the value 0. If you think this is already the case, don't change anything. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">degrees_zero</span>.
2. Convert the values in <span style="background-color: #F9EBEA; color:##C0392B">yearly_town_population</span> so that absolute zero is at the value 0. If you think this is already the case, don't change anything. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">population_zero.</span>

4. Working With Ordinal Scales
===

So far, we've looked at equal interval and discrete scales, where all of the values are numbers. We can also have ordinal scales, where items are ordered by rank. For example, we could ask people how many cigarettes they smoke per day, and the answers could be "none," "a few," "some," or "a lot." These answers don't map exactly to numbers of cigarettes, but we know that "a few" is more than "none."

This is an ordinal rating scale. We can assign numbers to the answers in a logical order to make them easier to work with. For example, we could map 0 to "none," 1 to "a few," 2 to "some," and so on.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

>```python
# Results from our survey on how many cigarettes people smoke per day
survey_responses = ["none", "some", "a lot", "none", "a few", "none", "none"]
survey_scale = ["none", "a few", "some", "a lot"]
```
tip: use list comprehension


1. In the following code block, assign a number to each survey response that corresponds with its position on the scale (<span style="background-color: #F9EBEA; color:##C0392B">"none"</span> is <span style="background-color: #F9EBEA; color:##C0392B">0</span>, and so on).
2. Compute the average value of all the survey responses, and assign it to <span style="background-color: #F9EBEA; color:##C0392B">average_smoking</span>.

5. Grouping Values with Categorical Scales
===

We can also have categorical scales, which group values into general categories. One example is gender, which can be male or female. Unlike ordinal scales, categorical scales don't have an order. In our gender example, for instance, one category isn't greater than or less than the other.

Categories are common in data science. You'll typically use them to split data into groups.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

>```python
gender = ["male", "female", "female", "male", "male", "female"]
savings = [1200, 5000, 3400, 2400, 2800, 4100]
```
tip: use list comprehension

1. Compute the average savings for everyone who is <span style="background-color: #F9EBEA; color:##C0392B">"male"</span>. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">"male_savings"</span>.
2. Compute the average savings for everyone who is <span style="background-color: #F9EBEA; color:##C0392B">"female"</span>. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">"female_savings"</span>.



6. Visualizing Counts with Frequency Histograms
===

Remember how statistics is all about counting? A **frequency histogram** is a type of plot that helps us visualize counts of data. These plots tally how many times each value occurs in a list, then graph the values on the x-axis and the counts on the y-axis. **Frequency histograms** give us a better understanding of where values fall within a data set.



In [None]:
import pandas as pd

students = pd.read_csv('ead.csv')
students.head()

In [None]:
# seaborn is commonly imported as `sns`.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#to switch to seaborn defaults, simply call the set() function.
sns.set()

plt.figure(figsize=(10, 8))


# The four preset contexts, in order of relative size, are paper, notebook, talk, and poster
sns.set_context("talk")

# plot a histogram based on categorical variables
sns.countplot(students['country'])

# rotate the axis
plt.xticks(rotation=90)

plt.show()

7. Measuring Data Skew
===

Now that you know how to make histograms, did you notice how the plots have "shapes?"

These shapes are important because they can show us the distributional characteristics of the data. The first characteristic we'll look at is <span style="background-color: #F9EBEA; color:##C0392B">skew</span>.

Skew refers to asymmetry in the data. When data is concentrated on the right side of the histogram, for example, we say it has a <span style="background-color: #F9EBEA; color:##C0392B">negative skew</span>. When the data is concentrated on the left, we say it has a <span style="background-color: #F9EBEA; color:##C0392B">positive skew</span>.

We can measure the level of skew with the skew function. A positive value indicates a positive skew, a negative value indicates a negative skew, and a value close to zero indicates no skew.

In [None]:
# We can test how skewed a distribution is using the skew function.
# A positive value means positive skew, 
# a negative value means negative skew, and close to zero means no skew.
from scipy.stats import skew

skewness = skew(students['country'].value_counts())
skewness

8. Checking for Outliers with Kurtosis
==

In probability theory and statistics, kurtosis (from Greek: κυρτός, kyrtos or kurtos, meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Depending on the particular measure of kurtosis that is used, there are various interpretations of kurtosis, and of how particular measures should be interpreted.

The kurtosis of any univariate normal distribution is 3. It is common to compare the kurtosis of a distribution to this value. Distributions with kurtosis less than 3 are said to be platykurtic, although this does not imply the distribution is "flat-topped" as sometimes reported. Rather, it means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers. Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution. It is also common practice to use an adjusted version of Pearson's kurtosis, the excess kurtosis, which is the kurtosis minus 3, to provide the comparison to the normal distribution. Some authors use "kurtosis" by itself to refer to the excess kurtosis. 


In [None]:
# We can measure kurtosis with the kurtosis function.
# Negative values indicate platykurtic distributions, positive values indicate
# leptokurtic distributions, and values near 0 are mesokurtic.

# platykurtic (< 0) = produces fewer and less extreme outliers than does the normal distribution
# leptokurtic (> 0) = produces more outliers than the normal distribution

from scipy.stats import kurtosis

kurtosiness = kurtosis(students['country'].value_counts())
kurtosiness


In [None]:
import numpy as np

# size of sample
N = 1000

# uniform sample
uniform_sample = np.array(
    [
        np.random.uniform(low=1, high=100, size=N)
  ]
)

# normal sample
normal_sample = np.array([
    np.random.normal(2,0.5,N)
])

# intermediate sample
meso_sample = np.array([60, 65, 63, 67, 68, 68, 69, 70, 71, 72, 77])


# sample with outliers
outlier_sample = np.array([60, 60, 61, 62, 63, 61, 60, 60, 61, 602, 63])

In [None]:
fig, ax = plt.subplots(figsize=(10,5), ncols=2, nrows=2)
# fig, (ax1, ax2, ax3) = plt.subplots(figsize=(10,5), ncols=3, nrows=1)

# organize space among figures
fig.tight_layout()

# main title
plt.suptitle("Checking for Outliers with Kurtosis", 
             fontsize=20,
            y = 1.09)

# title margin for each figure
y_title_margin = 1

### Titles of subplots
ax[0][0].set_title("Normal Sample (kurtosis = -3.0)", y = y_title_margin, fontsize=12)
ax[0][1].set_title("Uniform Sample (kurtosis = -3.0)",y = y_title_margin, fontsize=12)
ax[1][0].set_title("Meso Sample (kurtosis = -0.08)",y = y_title_margin, fontsize=12)
ax[1][1].set_title("Outlier Sample (kurtosis = 6.09)",y = y_title_margin, fontsize=12)


sns.distplot(normal_sample, kde = False, ax=ax[0][0])
sns.distplot(uniform_sample, kde = False, ax=ax[0][1])
sns.distplot(meso_sample, kde = False, ax=ax[1][0])
sns.distplot(outlier_sample, kde = False, ax=ax[1][1])

print('Kurtosis values',
      '\nNormal sample: ', kurtosis(normal_sample,fisher=False)[0],
      '\nUniform sample: ', kurtosis(uniform_sample,fisher=False)[0],
      '\nMeso sample: ', kurtosis(meso_sample,fisher=False),
      '\nOutlier sample: ', kurtosis(outlier_sample,fisher=False))

plt.show()

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Import the three sheets from <span style="background-color: #F9EBEA; color:##C0392B">FMC_I.xlsx</span> file.
2. Analyze the <span style="background-color: #F9EBEA; color:##C0392B">skew</span> and <span style="background-color: #F9EBEA; color:##C0392B">kurtosis</span> properties for each sheet using <span style="background-color: #F9EBEA; color:##C0392B">scipy.stats</span> and <span style="background-color: #F9EBEA; color:##C0392B">seaborn</span>. 
3. Use <span style="background-color: #F9EBEA; color:##C0392B">matplotlib.axes.Axes.axvline</span> to print mean and median values under FMC classes.


In [None]:
# Import pandas
import pandas as pd

# source
arquivo = 'FMC_I.xlsx'

# create a dataframe from a excel file
excel = pd.ExcelFile(arquivo)

# print sheet names
print(excel.sheet_names)

# class of FMC_I in 2017.1
T34 = excel.parse(0)
N12 = excel.parse(1)
M56 = excel.parse(2)

In [None]:
T34 = T34[T34.Estado != 'CANCELADO']
T34.head()

In [None]:
N12 = N12[N12.Estado != 'CANCELADO']
N12.head()

In [None]:
M56 = M56[M56.Estado != 'CANCELADO']
M56.head()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(figsize=(10,5), ncols=3, nrows=1)

# organize space among figures
fig.tight_layout()

# main title
plt.suptitle("Checking for Outliers with Kurtosis", 
             fontsize=20,
            y = 1.09)

# title margin for each figure
y_title_margin = 1

### Titles of subplots
ax1.set_title("N12 (kurtosis = -1.64)", y = y_title_margin, fontsize=12)
ax2.set_title("T34 (kurtosis = -1.48)",y = y_title_margin, fontsize=12)
ax3.set_title("M56 (kurtosis = 1.50)",y = y_title_margin, fontsize=12)


sns.distplot(N12['Média'], kde = False, ax=ax1,bins=20)
sns.distplot(T34['Média'], kde = False, ax=ax2,bins=20)
sns.distplot(M56['Média'], kde = False, ax=ax3,bins=20)

print('Kurtosis values',
      '\nN12: ', kurtosis(N12['Média']),
      '\nT34: ', kurtosis(T34['Média']),
      '\nM56: ', kurtosis(M56['Média']))
      

plt.show()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(figsize=(10,5), ncols=3, nrows=1)

# organize space among figures
fig.tight_layout()

# main title
plt.suptitle("Skew Analysis", 
             fontsize=20,
            y = 1.09)

# title margin for each figure
y_title_margin = 1

### Titles of subplots
ax1.set_title("N12 (skew = -0.11)", y = y_title_margin, fontsize=12)
ax2.set_title("T34 (skew = -0.06)",y = y_title_margin, fontsize=12)
ax3.set_title("M56 (skew = 1.70)",y = y_title_margin, fontsize=12)

sns.distplot(N12['Média'], kde = False, ax=ax1,bins=20)
sns.distplot(T34['Média'], kde = False, ax=ax2,bins=20)
sns.distplot(M56['Média'], kde = False, ax=ax3,bins=20)

print('Skew values',
      '\nN12: ', skew(N12['Média']),
      '\nT34: ', skew(T34['Média']),
      '\nM56: ', skew(M56['Média']))
      

    
    
plt.show()

In [None]:
M56['Média'].mode()

In [None]:
print(M56['Média'].mean())
print(M56['Média'].mode())
print(M56['Média'].median())

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(figsize=(10,5), ncols=3, nrows=1)

# organize space among figures
fig.tight_layout()

# main title
plt.suptitle("Mean (red) vs Median (green)", 
             fontsize=20,
            y = 1.09)

# title margin for each figure
y_title_margin = 1

In [None]:
### Titles of subplots
ax1.set_title("N12 ", y = y_title_margin, fontsize=12)
ax2.set_title("T34 ",y = y_title_margin, fontsize=12)
ax3.set_title("M56 ",y = y_title_margin, fontsize=12)

# Plot the mean in red
ax1.axvline(N12['Média'].mean(), color="r")
ax2.axvline(T34['Média'].mean(), color="r")
ax3.axvline(M56['Média'].mean(), color="r")

# Plot the median in green
ax1.axvline(N12['Média'].median(), color="g")
ax2.axvline(T34['Média'].median(), color="g")
ax3.axvline(M56['Média'].median(), color="g")


sns.distplot(N12['Média'], kde = False, ax=ax1,bins=20)
sns.distplot(T34['Média'], kde = False, ax=ax2,bins=20)
sns.distplot(M56['Média'], kde = False, ax=ax3,bins=20)

print('Kurtosis values',
      '\nN12: ', skew(N12['Média']),
      '\nT34: ', skew(T34['Média']),
      '\nM56: ', skew(M56['Média']))
      

    
    
plt.show()