# DS105 Intermediate Statistics : Lesson Two Normality and Transformations
## Data Transformations

### Table of Contents <a class="anchor" id="DS105L2_toc"></a>

* [Table of Contents](#DS105L2_toc)
    * [Page 1 - Introduction](#DS105L2_page_1)
    * [Page 2 - Normality](#DS105L2_page_2)
    * [Page 3 - Data Transformations](#DS105L2_page_3)
    * [Page 4 - Transformations in R](#DS105L2_page_4)
    * [Page 5 - Transformations in R Activity](#DS105L2_page_5)
    * [Page 6 - Transformations in R Activity Solution](#DS105L2_page_6)
    * [Page 7 - Transformations in Python](#DS105L2_page_7)
    * [Page 8 - Transformations in Python Activity](#DS105L2_page_8)
    * [Page 9 - Transformations in Python Activity Solution](#DS105L2_page_9)
    * [Page 10 - Key Terms](#DS105L2_page_10)
    * [Page 11 - Lesson 2 Hands-On](#DS105L2_page_11)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS105L2_page_1"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Basic Statistics in Python
VimeoVideo('388868344', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L02overview.zip)**.

# Introduction

Up to this point, you have learned that normality is an assumption for many statistical tests, and it will be for many more you still have to learn.  However, you've never learned what to do when the data isn't normal - you've just soldiered on.  You'll now learn how to transform your data so that it becomes normal.  


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Normality<a class="anchor" id="DS105L2_page_2"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Skew and Kurtosis
VimeoVideo('335019896', width=720, height=480)

```c-lms
topic: Normality
video-id: Skew and Kurtosis
video-url-mp4: https://player.vimeo.com/external/335019896.hd.mp4?s=53826cc2933dfccf074c99fa43fade47a7969aff&profile_id=175
video-url-mp4-1080: https://player.vimeo.com/external/335019896.hd.mp4?s=53826cc2933dfccf074c99fa43fade47a7969aff&profile_id=175
video-url-mp4-720: https://player.vimeo.com/external/335019896.hd.mp4?s=53826cc2933dfccf074c99fa43fade47a7969aff&profile_id=174
video-url-mp4-540: https://player.vimeo.com/external/335019896.sd.mp4?s=ecba9491cd7d721398956d63208914a8eb73e990&profile_id=165
video-url-mp4-360: https://player.vimeo.com/external/335019896.sd.mp4?s=ecba9491cd7d721398956d63208914a8eb73e990&profile_id=164
```

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L02pg2tutorial.zip)**.

# Normal Distribution Review

As a reminder, the normal distribution is also called the bell curve because it takes on a lovely bell shape.  For instance, look at this one down below: 

![A curve is plotted on a graph. Three vertical lines are drawn from the curve to meet the x-axis at three points labeled, mu minus sigma, mu, mu plus sigma.](Media/skew1.png)

But data comes in many more shapes than that, even if the normal distribution does come up more than you'd think. 

---
# Skew

When the bell curve is shifted to the left or right, then this is called *skew*.  A *negatively skewed distribution* has a tail of data on the left, and the bulk of the data on the right.  A *positively skewed distribution* has the tail of data on the right, and the bulk of the data on the left.  

![Two graphs labeled negative skew and positive skew. A dashed line meets the x-axis from the peak point.](Media/skew2.png)

The data being shifted from the center means that the average will no longer be centered either. In a normal distribution, the mean, median, and mode are pretty much the same.  But if it's negatively skewed, the mean becomes less than the mode and the median, and if it's positively skewed, the mean becomes greater than the mode and the median. 

![A graph depicts the plot of random variable against the value of function. Three curves are plotted and they are labeled left skew mean is less than mode, normal mean equals mode, and right skew mean exceeds mode.](Media/skew4.png)

---
# Kurtosis

When the bell curve is much taller or shorter than the normal distribution, then it is said to have *kurtosis*.  A *positively kurtotic distribution* is really tall.  The technical term for this is *leptokurtic*. A *negatively kurtotic distribution* is really flat or may even actually dip downwards in the opposite direction. Another word for negatively kurtotic is *platykurtic*.   

![A deep inverted curve labeled positive kurtosis leptokurtic and an almost-flat curve labeled negative kurtosis platykurtic.](Media/skew3.png)

---

# Overview

This info is summarized nicely in the figure below: 

![A table has six graphs in two columns and three rows. The column headings are labeled skewness and kurtosis. The graphs on the skewness column are labeled plus, zero, and minus. The graphs on the kurtosis column are labeled platykurtic, mesokurtic, and leptokurtic.](Media/skew5.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Data Transformations<a class="anchor" id="DS105L2_page_3"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Basic Statistics in Python
VimeoVideo('335019817', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L02pg3tutorial.zip)**.

# Transformations

Both skew and kurtosis affect many statistical analyses and make them inaccurate, so you need to correct for them if normality is an assumption.  Correcting for skew and kurtosis is called *transformation*.   

---

## Pros and Cons to Transforming Your Data

Pros: 

* Makes your distribution normal
* Helps fix issues with homoscedasticity (even variance) and non-linearity
* Improves the accuracy of the results

There is only one major con.  In transforming your data, you change the scale, which means that interpretation, especially for regressions, can be much more difficult. 

---

## Common Transformations

When you transform data, you change the *power*.  Power is basically: 

```text
1 - slope
```

The table below shows the most common data transformations and how to do them. 

<table class="table table-striped">
    <tr>
        <th>Power</th>
        <th>Transformation</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>3</td>
        <td>Cube</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>2</td>
        <td>Square</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>1</td>
        <td>No change</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>1/2</td>
        <td>Square root</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>0</td>
        <td>Logarithm</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>-1/2</td>
        <td>Reciprocal square root</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>-1</td>
        <td>Reciprocal</td>
    </tr>
</table>

---

### Transforming Negatively Skewed Data

The baseline on the chart above is 1.  That would be your data left alone, with no transformation.  Go up from one to a power of two or three, and those transformations will help with negatively skewed data.  The higher the power, the stronger the transformation; so start small and work your way up.  

---

### Transforming Positively Skewed Data

Go down from one, and the next two transformations (power 1/2 and 0) are meant to help with positively skewed data. The smaller the power and the farther away from one, the stronger the transformation. 

---

### Transforming Data with the Reciprocal

Lastly, there are two negative value powers, which have you take the reciprocal.  Those are for flipping your data, or reversing the direction it goes in.

---

### Prescribing Transformations

Below is a figure that provides the original, non-parametric data, and the most likely transformation to use.  For instance, if your data looks like the upper left picture, positively skewed but not very kurtotic, try a square root transformation.  

![A box labeled transformation on the top of six graphs. The graphs are labeled square root, reflect and square root, logarithm, reflect and logarithm, inverse, and reflect and inverse. The figure has a caption on the bottom that reads, original distributions and common transformations to produce normality.](Media/skew6.png)
*From Tabachnick, B.G. & Fidell, L.S. (2007) Using Multivariate Statistics (5th ed). Boston, MA: Pearson.*

Although this figure will help guide you, no data distribution is exactly the same, and you may need to try several different transformations before you get it right.  Choose the transformation that most closely approximates the normal distribution. It may not be perfect, but it should at least be better.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Always look at a histogram or QQ plot of your data after transformation to see if it did what you intended! </p>
    </div>
</div>

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Transformations in R<a class="anchor" id="DS105L2_page_4"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Basic Statistics in Python
VimeoVideo('335019940', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L02pg4tutorial.zip)**.# Transformations in R

Now that you have a basic understanding of what transformations are, you'll do them in R.

---
## Visualizing Transformations 

You've already learned how to visually check for normality with ```ggplot```'s ```geom_histogram``` and ```geom_qq```.  But wouldn't it be awesome if a curve was printed on your histogram for you, so you could better assess whether it looked approximately normal? 

Well, good news! There is. It comes from the ```rcompanion``` library, and is the function ```plotNormalHistogram```.  

You'll use some of the datasets you used last lesson to inspect the data for normality.  How about taking a peek at the **[anime data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/anime.zip)** you were using last hands on? 

First, you'll need to install and import ```rcompanion```.

```{r}
library(rcompanion)
```

Then, you can use the ```plotNormalHistogram``` function on the ```score``` variable: 

```{r}
plotNormalHistogram(anime$score)
```

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 1940 to 2020 in 5 units and the y-axis ranges from 0 to 1500 in 4 units. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew7.png)

Oh good! This one looks to be approximately normally distributed.  No need to transform that. 

---

## Transforming Positively Skewed Data

How about ```scored_by```?

```{r}
plotNormalHistogram(anime$scored_by)
```

![A graph depicts the plot of x against the frequency. The value of x ranges from 0 e plus 00 to 1 e plus 06 in 6 units. A curve is plotted along the bars. The curve and the bars share almost equal peak points.](Media/skew8.png)

No such luck on that one.  It looks to be horribly positively skewed. 

---

### Using sqrt()

So start with a square root transformation, by using the function ```sqrt()``` just as a mathematical operation: 

```{r}
anime$scored_bySQRT <- sqrt(anime$scored_by)
```

Which will produce and image that will look like this:

![sqrttransformation](Media/105sqrttransformation.png "sqrt transformation")

This transforms the ```scored_by``` variable by square rooting it.  It is then stored in a new column in anime, called ```scored_bySQRT``` so you can tell the old and the transformed data apart. 

Then run the histogram again: 

```{r}
plotNormalHistogram(anime$scored_bySQRT)
```



It looks better than before, but perhaps it can be better still.  

---

### Using log()

So make a larger transformation - you will now take the log of the ```scored_by``` variable, using the ```log()``` function: 

```{r}
anime$scored_byLOG <- log(anime$scored_by)
```

Then try to graph: 

```{r}
plotNormalHistogram(anime$scored_byLOG)
```

---

### Removing Infinite Values

You probably got an error here: 

```Error in seq.default(min(x), max(x), length = length) : 'from' must be a finite number```

This is because the log transformation made a few of the numbers so small that R gave up, and just gave them the value of ```-inf```, or infinitely small.  Most functions in R will not work with infinite data, including this one. There is, however, a package and a line of code that can save the day. 

```{r}
library("IDPmisc")
anime2 <- NaRV.omit(anime)
```

The library ```IDPmisc``` contains an ability to omit missing and infinite value data, called ```NaRV.omit()```.  Simply put your dataset name in as the argument.  Then graph again, with the new dataset name: 

```{r}
plotNormalHistogram(anime2$scored_byLOG)
```

And you are good to go!

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 2 to 14 in seven units and the y-axis ranges from 0 to 1000. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew9.png)

---

## Transforming Negatively Skewed Data

You will follow a similar process to transform negatively skewed data, but this time, you will use squaring or cubing to transform your data.  

The variable ```aired_from_year``` is negatively skewed: 

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 1940 to 2020 in 5 units and the y-axis ranges from 0 to 1500 in 4 units. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew10.png)

---

### Squaring the Variable

So you can start by squaring it: 

```{r}
anime$aired_from_yearSQ <- anime$aired_from_year * anime$aired_from_year
```

Then take a peek at it again: 

```{r}
plotNormalHistogram(anime$aired_from_yearSQ)
```

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 3750000 to 4050000 in 7 units and the y-axis ranges from 0 to 1500 in 4 units. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew11.png)

That looks a bit better, but not amazing.  No where near normally distributed.  

---

### Cubing the Variable 

Now try cubing it: 

```{r}
anime$aired_from_yearCUBE <- anime$aired_from_year ^ 3
```

And take another look:

```{r}
plotNormalHistogram(anime$aired_from_yearCUBE)
```

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 7.4 e plus 09 to 8.2 e plus 09 in 5 units and the y-axis ranges from 0 to 1500 in 4 units. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew12.png)

Overall, that definitely looks like the best option, so you can settle for that.  Even though it is not perfect, you are out of options.

---

## Tukey's Ladder of Power Transformation

Wouldn't it be wonderful if there was some function that automatically transformed your data correctly, without all this time consuming guess work? Well, there is great news for you! There is.  It's called the ```transformTukey``` function, which performs Tukey's Ladder of Power Transformation, and is available from the ```rcompanion``` package as well. 

You will try out this new wizardry using the **[cruise ship data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/cruise_ship.zip)**.  Here is what the original data looks like: 

```{r}
library(rcompanion)
plotNormalHistogram(cruise_ship$Tonnage)
```

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 0 to 10000 in 6 units and the y-axis ranges from 0 to 40 in 5 units. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew13.png)

And then you can transform it with Tukey's, where the function ```transformTukey()``` takes the argument ```data$variable``` and where the argument ```plotit=``` takes either ```TRUE``` to add a plot or ```FALSE``` to do without.

```
cruise_ship$TonnageTUK <- transformTukey(cruise_ship$Tonnage, plotit=FALSE)
```

Then you can plot again to see if it made a difference: 

```{r}
plotNormalHistogram(cruise_ship$TonnageTUK)
```

![A graph depicts the plot of x values against the frequency. The x-axis ranges from 0 to 10000 in 6 units and the y-axis ranges from 0 to 40 in 5 units. A curve is plotted on the bars. The curve and the bars have similar peak points.](Media/skew14.png)

And it did! Your data is now approximately normal, and it didn't require you to do a lot of trial and error guesswork.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Transformations in R Activity<a class="anchor" id="DS105L2_page_5"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Activity, you will transform data in R. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[cruise ship data from last lesson](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/cruise_ship.zip)**, determine whether each continuous variable is positively skewed, negatively skewed, or normally distributed.  Then perform the correct transformations to get as close to the normal distribution as possible for each variable. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Transformations in R Activity Solution<a class="anchor" id="DS105L2_page_6"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

---

## Answers

The following variables are positively skewed and can be transformed by taking the square root: 

* Tonnage
* Cabins
* passngrs
* Crew
* PassSpcR
* outcab

The following variables are negatively skewed: 

* YearBlt
* Length

```YearBlt``` should be transformed by cubing, ```Length``` by squaring.

---

## Code

```{r}
# Load the appropriate library

library(rcompanion)

# Look to see what all the distributions are

plotNormalHistogram(cruise_ship$YearBlt)
plotNormalHistogram(cruise_ship$Tonnage)
plotNormalHistogram(cruise_ship$passngrs)
plotNormalHistogram(cruise_ship$Length)
plotNormalHistogram(cruise_ship$Cabins)
plotNormalHistogram(cruise_ship$Crew)
plotNormalHistogram(cruise_ship$PassSpcR)
plotNormalHistogram(cruise_ship$outcab)

# Transform Positively Skewed Variables

cruise_ship$TonnageSQRT <- sqrt(cruise_ship$Tonnage)
cruise_ship$passngrsSQRT <- sqrt(cruise_ship$passngrs)
cruise_ship$CabinsSQRT <- sqrt(cruise_ship$Cabins)
cruise_ship$CrewSQRT <- sqrt(cruise_ship$Crew)
cruise_ship$PassSpcRSQRT <- sqrt(cruise_ship$PassSpcR)
cruise_ship$outcabSQRT <- sqrt(cruise_ship$outcab)

# See if that fixes the issues

plotNormalHistogram(cruise_ship$TonnageSQRT)
plotNormalHistogram(cruise_ship$passngrsSQRT)
plotNormalHistogram(cruise_ship$CabinsSQRT)
plotNormalHistogram(cruise_ship$CrewSQRT)
plotNormalHistogram(cruise_ship$PassSpcRSQRT)
plotNormalHistogram(cruise_ship$outcabSQRT)

# Transform Negatively Skewed Variables

cruise_ship$YearBltSQ <- cruise_ship$YearBlt * cruise_ship$YearBlt
cruise_ship$LengthSQ <- cruise_ship$Length * cruise_ship$Length

# See if that made them normal

plotNormalHistogram(cruise_ship$YearBltSQ)
plotNormalHistogram(cruise_ship$LengthSQ)

# Length looks ok, but YaerBlt could use additional transformation

cruise_ship$YearBltCUBE <- cruise_ship$YearBlt ^3

plotNormalHistogram(cruise_ship$YearBltCUBE)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Transformations in Python<a class="anchor" id="DS105L2_page_7"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Basic Statistics in Python
VimeoVideo('335021517', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L02pg7tutorial.zip)**.

# Transformations in Python

Data transformations can also be done in Python.  You'll use the same **[anime data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/anime.zip)** that you used to transform data in R.

---

## Visualizing Transformations

```pandas``` has a function called ```.hist()``` that works pretty well for data normalization visualization, because it just ignores missing values.  That means that you don't need to do a whole lot of data cleaning.  

Here's how ```.hist()``` works: 

```python
import pandas as pd
anime.aired_from_year.hist()
```

Just place the data frame name first, then the variable name, and finally call the function ```.hist()```. The only downside to ```.hist()``` is that is does not provide a fitted curve.  

![A bar chart represents the year from 1940 to 2020 in 9 units on the x-axis and the numbers from 0 to 3000 on the y-axis. The bars are plotted in an increasing pattern.](Media/skew15.png)

If you want a fitted curve, you'll need to pull from ```seaborn```, using the ```distplot()``` function.

### [Errata](Errata/DS105-Change-Log.ipynb)

```python
import seaborn as sns
sns.histplot(anime['aired_from_year'])
```
<a class="anchor" id="histplot"></a>
Here is the result: 

![A graph represents the plot with the representation of aired from year on the x-axis. The x-axis ranges from 1940 to 2020. The y-axis ranges from 0.00 to 0.10 in six units. A curve along with the bars is plotted on the graph.](Media/skew16.png)

---

## Transforming Positively Skewed Data

Take a look at the ```scored_by``` variable: 

```python
anime.scored_by.hist()
```

![The x-axis on a graph ranges from 0 to 1000000 and the y-axis ranges from 0 to 6000. The tallest bar crosses the point 6000 on the y-axis.](Media/skew17.png)

It is quite positively skewed! 

---

### Using np.sqrt()

So, try a square root transformation first, using the function from ```numpy``` called ```.sqrt()```: 

```python
anime['scored_bySQRT'] = np.sqrt(anime['scored_by'])
```

![The x-axis on a graph ranges from 0 to 1000 and the y-axis ranges from 0 to 4000. The tallest bar crosses the point 4000 on the y-axis.](Media/skew18.png)

Looking at the above graph, that is better, but still no where close to normal.  

---

### Using np.log()

So, try a log transformation! 

```python
anime['scored_byLOG'] = np.log(anime['scored_by'])
```

Again, this ```.log()``` function comes from ```numpy```.  You may get a warning when you run it like this one: 

```
RuntimeWarning: divide by zero encountered in log
```

This tells you that there will be infinite values! 

---

### Dealing with Infinite Data

If you try and run the histogram of your log transformed data when you've been given a warning about infinite values, you will get this error: 

```python
ValueError: range parameter must be finite.
```

But don't panic! There is a fix, and a relatively easy one at that.  Dropping ```na``` values will also get rid of infinite values! Simply call the ```dropna()``` function on your dataset.

```python
anime.dropna(inplace=True)
```

And then you are ready to rumble with your histogram once more: 

```python
anime.scored_byLOG.hist()
```

![A bar chart with eight units on the x-axis and seven units on the y-axis. The x-axis ranges from seven to fourteen and the y-axis ranges from 0 to 70.](Media/skew19.png)

That doesn't look too shabby! Not quite normal, but you'll take it. 

---

## Transforming Negatively Skewed Data

In order to transform negatively skewed data, you will either square or cube your data. 

---

### Squaring the Variable

How about trying to transform the ```aired_from_year``` variable that you looked at earlier? It had a relatively large negative skew to it. So, start by squaring your data. The ```**``` means that you are raising the variable to a power of 2. 

```python
anime['aired_from_yearSQ'] = anime['aired_from_year']**2
```

Then take a look at the histogram to assess your progress: 

```python
anime.aired_from_yearSQ.hist()
```

![A bar chart with six units on the x-axis and the y-axis. The x-axis ranges from 3800000 to 4050000 and the y-axis ranges from 0 to 3000. The bars are plotted in an increasing pattern.](Media/skew20.png)

---

### Cubing the Variable

The histogram above still does not look very normal, which means that it is time to try cubing it!

```python
anime['aired_from_yearCUBE'] = anime['aired_from_year']**3
```

Then check the histogram once more: 

```python
anime.aired_from_yearCUBE.hist()
```

![A bar chart with five units on the x-axis and the y-axis. The x-axis ranges from 0 to 8.2 and the y-axis ranges from 0 to 3000. The bars are plotted in an increasing pattern.](Media/skew21.png)

This has not made a lot of impact, but it is slightly better than the original, so you most likely want to use the cubed transformation.

---

## BoxCox Transformation

Just like the Tukey's Ladder of Power Transformations in R, you can transform by power in Python as well.  However, the Python version, called ```boxcox()```, has limited functionality. It does not seem to work well for negatively skewed data (so check your results very carefully!), and it does not automatically transform your data, unlike R. It just runs off a power system.

In order to use the ```boxcox()``` function, you will need to import a few packages:

```python
from scipy import stats
from scipy.stats import boxcox
```

Then you can call the function into a new variable, like you have been doing: 

```python
anime['scored_byLOG1'] = boxcox(anime['scored_by'],0)
```

Use the ```boxcox()``` function, and specify the variable.  Then the last argument is the power value that you learned about at the beginning of this lesson.  A power of 0 is a log transformation.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Transformations in Python Activity<a class="anchor" id="DS105L2_page_8"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Activity, you will transform data in Python. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[cruise ship data from last lesson](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/cruise_ship.zip)**, determine whether each continuous variable is positively skewed, negatively skewed, or normally distributed.  Then perform the correct transformations to get as close to the normal distribution as possible for each variable. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Transformations in Python Activity Solution<a class="anchor" id="DS105L2_page_9"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

Here is a [Jupyter Notebook with the answers](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/Transforming_Data_in_Python_Activity_Solution.zip), so that you can check your work! 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You will need to extract the zip file and save it to your computer, then open it in Jupyter Notebook in order to open the above file.</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Key Terms<a class="anchor" id="DS105L2_page_10"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

---

## Key Terms

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Skew</td>
        <td>Non-parametric distribution that is not normal horizontally from side to side.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Negative Skew</td>
        <td>A distribution that has the tail to the left and the bulk of data to the right.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Positive Skew</td>
        <td>A distribution that has the tail to the right and the bulk of the data to the left.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Kurtosis</td>
        <td>Non-parametric distribution that is not normally vertically, from top to bottom.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Leptokurtic</td>
        <td>A positively kurtotic distribution that looks taller than the normal distribution.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Platykurtic</td>
        <td>A negatively kurtotic distribution that looks flatter than the normal distribution.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mesokurtic</td>
        <td>The normal distribution's kurtosis. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Power</td>
        <td>Defined as 1-slope, power is on a ladder system that determines how data can be transformed.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Reciprocal or Reflection</td>
        <td>The exact opposite of your current data; flipped.</td>
    </tr>
</table>

---

## Key R Libraries 

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>rcompanion</td>
        <td>Used to test for the normal distribution. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>IDPmisc</td>
        <td>A library used to screen out missing values.</td>
    </tr>
</table>


---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plotNormalHistogram()</td>
        <td>Easily creates a histogram with a best-fit curve to show how it matches the normal distribution.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sqrt()</td>
        <td>Takes the square root of a variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>log()</td>
        <td>Takes the log of a variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>NaRV.omit()</td>
        <td>Removes missing and infinite values from a dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>transformTukey</td>
        <td>Automatically transforms your data to approximate the normal distribution as best as possible.</td>
    </tr>
</table>


---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>boxcox</td>
        <td>A package that houses the BoxCox transformation.</td>
    </tr>
</table>


---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>np.sqrt()</td>
        <td>Takes the square root of a variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>np.log()</td>
        <td>Takes the log of a variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.dropna(inplace=True)</td>
        <td>Drops missing values.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>boxcox()</td>
        <td>Performs a BoxCox transformation based on the power level you choose.</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Lesson 2 Hands-On<a class="anchor" id="DS105L2_page_11"></a>

[Back to Top](#DS105L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

For this Hands On, you will be transforming data in both Python and R.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

This hands on uses a dataset about the number of trips done at the Seattle Parks and Recreation department.  It is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/Seattle_ParksnRec.zip)**. For each part, assess and transform the requested data, then submit your annotated program files for review.

---

### Part I: Transforming Data in Python

In Python, assess the skew of the distribution and then transform it if necessary for the following variables:  

* ```# of trips Winter```
* ```# of participants Winter```
* ```# of trips Spring```
* ```# of participants Spring```
* ```# of trips Summer```
* ```# of participants Summer```

Please make notes about each variable's distribution and the transformation you made in your Python file and submit.

---

### Part II: Transforming Data in R

In R, assess the skew of the distribution and then transform it if necessary for the following variables:  

* ```# of trips Fall```
* ```# of participants Fall```
* ```# of trips per Year```
* ```# participants per Year```
* ```increase/decrease of prior year```
* ```Average # people per trip```

Please make notes about each variable's distribution and the transformation you made in your R script and submit.


<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>