# Comparing The Distribution of CO2 Emissions By Country in 2000 and in 2017

By: Carnell Zhou cz375
    Ellen Li el667

Introduction: 
The 26th installment of the FIFA series, FIFA 19 is a competitive football simulation video game. The game features a large selection of players from real life, each with factual specifications such as their club, age, preferred foot, wage and nationality as well as appointed numerical scores that estimate their ability in a host of categories including dribbling, ball control, composure, and accuracy. Inside the game, the user can assume control of the different in-game players by passing the ball between them. In addition, the user can control an individual player’s motion, shot-timing (to produce more accurate shots), and dribbling moves to get around defenders.  

Description:
The data set that we chose was a representation of FIFA 19's Player Attributes. The data set details 18206 unique players each with the attributes: Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

# Cleaning the Dataset

First, let's clean the dataset. Let's begin by importing pandas and munpy. 

In [None]:
import pandas as pd
import numpy as np

The Problem:
Will a player's overall rating surpass a score of 70?
Will a certain selection of attributes be a better predictor of whether their overall surpasses 70?

Hypothesis:
1. A certain set of attributes representing athleticism (ie height and weight) will be a better predictor than a different set of attributes representing skill (dribbling and accuracy) in determining whether overall > 70



In [None]:
fifa = pd.read_csv('../input/fifa19/data.csv')
score_below_70 = fifa.loc[(fifa['Overall']<70)]
score_above_70 = fifa.loc[(fifa['Overall']>=70)]
score_below_70.head()

In [None]:
score_above_70.head()

In [None]:
fifa['Weight'].mean()

In [None]:
fifa['Weight'].fillna('166lbs', inplace = True)
fifa['Height'].fillna("5'11", inplace = True)
fifa['Overall'].fillna('66', inplace = True)

Now let's try to visualize these two datasets with a correlation matrix.

In [None]:
fifahw = pd.DataFrame(fifa, columns = ['Height','Weight','Overall'])
fifahw.sample(10)

In [None]:
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import sklearn

sn.heatmap(fifa[['Height','Weight','Overall']].corr(), annot = True)

plt.title('Correlation Matrix')
plt.show()

In [None]:
fifa['Dribbling'].mean()

In [None]:
fifa['Dribbling'].fillna('55', inplace = True)

In [None]:
fifa['FKAccuracy'].mean()

In [None]:
fifa['FKAccuracy'].fillna('43', inplace = True)

In [None]:
sn.heatmap(fifa[['Dribbling','FKAccuracy','Overall']].corr(), annot = True)

plt.title('Correlation Matrix')
plt.show()

In [None]:
three_correlated_features = fifa[['Height', 'Weight', 'Overall']]
#for three_correlated_features:
x_train, x_test, y_train, y_test = train_test_split(three_correlated_features, target, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(x_train, y_train)
print(model.score(x_test, y_test))



In [None]:
import seaborn as sns

In [None]:
emissions_china = emissions.loc[(emissions['Entity']=='China')]
emissions_china_year = emissions_china[['Year','Annual CO₂ emissions (tonnes )']]
emissions_china_year.head()

In [None]:
emissions_us = emissions.loc[(emissions['Entity']=='United States')]
emissions_us_year = emissions_us[['Year','Annual CO₂ emissions (tonnes )']]
emissions_us_year.head()

In [None]:
import matplotlib.pyplot as plt
plt.plot(emissions_china_year['Year'],emissions_china_year['Annual CO₂ emissions (tonnes )'], label='China')
plt.plot(emissions_us_year['Year'],emissions_us_year['Annual CO₂ emissions (tonnes )'], label='US')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Annual CO₂ emissions (tonnes)')
plt.title('China vs US CO₂ emissions')
plt.show()

Looking at this 3d scatterplot, we can see that the versicolor and virginica is actually much more separable, unlike what was indicated in our previous 2d plot. We can therefore conclude that these three features are enough to implement an effective classifier.
### Heat Map
Another plot for visualizing density is a heat map. Heat maps are a bivariate distribution which assigns colors to different regions depending on the density (or frequency or magnitude, depending on which feature you are trying to visualize) of values in that region.

In [None]:
Z, hmx, hmy = np.histogram2d(emissions_china_year['Year'],emissions_china_year['Annual CO₂ emissions (tonnes )'])

plt.title('Heatmap of China\'s Carbon Emissions')
plt.xlabel('Year')
plt.ylabel('Annual CO₂ emissions (tonnes )')
plt.pcolormesh(hmx,hmy,Z.T)

plt.show()
