# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
%matplotlib inline
# import numpy and pandas

import numpy as no
import pandas as pd

# Challenge 1 - Analysis of Variance

In this part of the lesson, we will perform an analysis of variance to determine whether the factors in our model create a significant difference in the group means. We will be examining a dataset of FIFA players. We'll start by loading the data using the code in the cell below.

In [2]:
# Run this code:

fifa = pd.read_csv('fifa.csv')

Let's examine the dataset by looking at the `head`.

In [65]:
# Your code here:

fifa.Value.value_counts()

€1.1M      431
€375K      372
€425K      354
€325K      351
€450K      343
          ... 
€36M         1
€43M         1
€46M         1
€59M         1
€110.5M      1
Name: Value, Length: 217, dtype: int64

In [5]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            18207 non-null  object 
 1   Age             18207 non-null  int64  
 2   Nationality     18207 non-null  object 
 3   Overall         18207 non-null  int64  
 4   Potential       18207 non-null  int64  
 5   Club            17966 non-null  object 
 6   Value           18207 non-null  object 
 7   Preferred Foot  18159 non-null  object 
 8   Position        18147 non-null  object 
 9   Weak Foot       18159 non-null  float64
 10  Acceleration    18159 non-null  float64
 11  SprintSpeed     18159 non-null  float64
 12  Stamina         18159 non-null  float64
 13  Strength        18159 non-null  float64
 14  LongShots       18159 non-null  float64
 15  Aggression      18159 non-null  float64
 16  Interceptions   18159 non-null  float64
 17  Positioning     18159 non-null 

Player's values are expressed in euros. We would like this column to be numeric. Therefore, let's create a numeric value column. Do this by stripping all non-numeric characters from each cell and assign this new data to `ValueNumeric`. There is no need to multiply the value to be expressed in millions or thousands but converting them carefully into the same scale. 

In [17]:
# Your code here:
fifa.Value[0].replace('€', '').replace('M', '')

'110.5'

In [None]:
df.Marca = df.Marca.str.replace(r'\s+\d', '', regex = True).str.strip(' ')

In [66]:
fifa.ValueNumeric = fifa.Value.str.replace('€', '').str.replace('M', '').str.replace('K', '/1000')

In [67]:
fifa.ValueNumeric = pd.to_numeric(fifa.ValueNumeric)

ValueError: Unable to parse string "600/1000" at position 926

In [39]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            18207 non-null  object 
 1   Age             18207 non-null  int64  
 2   Nationality     18207 non-null  object 
 3   Overall         18207 non-null  int64  
 4   Potential       18207 non-null  int64  
 5   Club            17966 non-null  object 
 6   Value           18207 non-null  object 
 7   Preferred Foot  18159 non-null  object 
 8   Position        18147 non-null  object 
 9   Weak Foot       18159 non-null  float64
 10  Acceleration    18159 non-null  float64
 11  SprintSpeed     18159 non-null  float64
 12  Stamina         18159 non-null  float64
 13  Strength        18159 non-null  float64
 14  LongShots       18159 non-null  float64
 15  Aggression      18159 non-null  float64
 16  Interceptions   18159 non-null  float64
 17  Positioning     18159 non-null 

#### We'd like to determine whether a player's preffered foot and position have an impact on their value. 

Using the `statsmodels` library, we are able to produce an ANOVA table without munging our data. Create an ANOVA table with value as a function of position and preferred foot. Recall that pivoting is performed by the `C` function.

Hint: For columns that have a space in their name, it is best to refer to the column using the dataframe (For example: for column `A`, we will use `df['A']`).

In [41]:
# Your code here:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import f_oneway


In [44]:
modelo_simple = ols('ValueNumeric ~ Position', data = fifa).fit()

In [45]:
sm.stats.anova_lm(modelo_simple)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Position,26.0,9006867.0,346417.962922,4.141509,7.234254e-12
Residual,18120.0,1515654000.0,83645.344349,,


In [48]:
fifa.columns = fifa.columns.str.replace(' ', '_')

In [54]:
fifa.columns

Index(['Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value',
       'Preferred_Foot', 'Position', 'Weak_Foot', 'Acceleration',
       'SprintSpeed', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'ValueNumeric'],
      dtype='object')

In [55]:
modelo_simple_2 = ols('ValueNumeric ~ Preferred_Foot', data = fifa).fit()

In [56]:
sm.stats.anova_lm(modelo_simple_2)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Preferred_Foot,1.0,316189.3,316189.318982,3.764163,0.052378
Residual,18157.0,1525186000.0,83999.903256,,


In [57]:
modelo_simple_3 = ols('ValueNumeric ~ Position + Preferred_Foot', data = fifa).fit()
sm.stats.anova_lm(modelo_simple_3)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Position,26.0,9006867.0,346417.962922,4.14148,7.236436e-12
Preferred_Foot,1.0,72922.11,72922.110443,0.871795,0.3504713
Residual,18119.0,1515581000.0,83645.936172,,


What is your conclusion from this ANOVA?

In [6]:
# Your conclusions here:



After looking at a model of both preffered foot and position, we decide to create an ANOVA table for nationality. Create an ANOVA table for numeric value as a function of nationality.

In [8]:
# Your code here:



What is your conclusion from this ANOVA?

# Challenge 2 - Linear Regression

Our goal with using linear regression is to create a mathematical model that will enable us to predict the outcome of one variable using one or more additional independent variables.

We'll start by ensuring there are no missing values. Examine all variables for all missing values. If there are missing values in a row, remove the entire row.

In [9]:
# Your code here:



Using the FIFA dataset, in the cell below, create a linear model predicting value using stamina and sprint speed. create the model using `statsmodels`. Print the model summary.

Hint: remember to add an intercept to the model using the `add_constant` function.

In [10]:
# Your code here:



Report your findings from the model summary. In particular, report about the model as a whole using the F-test and how much variation is predicted by the model using the r squared.

In [11]:
# Your conclusions here:



Next, create a second regression model predicting value using potential. Create the model using `statsmodels` and print the model summary. Remember to add a constant term.

In [12]:
# Your code here:



Report your findings from the model summary. In particular, report about the model as a whole using the F-test and how much variation is predicted by the model using the r squared.

In [13]:
# Your conclusions here:



Plot a scatter plot of value vs. potential. Do you see a linear relationship?

In [14]:
# Your code here:

