# Assignment #2 - Data Gathering and Warehousing - DSSA-5102

Instructor: Melissa Laurino</br>
Spring 2024</br></br>
Name: Udoy Chowdhury</br>
Date: 2024-01-25

Our next objective is to choose <b>ONE</b> of the datasets from our previous assignment to explore further. The datasets we have chose for Assignment #1 are managable to clean in R (Or Python if that is what you prefer to explore, see the technology check for working with Python in R in Jupyter notebook). Depending on your data, and especially the size of it, it may be more beneficial to clean in a language we are comfortable working in already instead of cleaning our data in SQL. SQL may be needed for cleaning of databases that are very large or hundreds of terabytes in size. We will clean our datasets first before we attempt to load them into our SQL databases. </br>
Not only is data everywhere, but it can also be messy. Messy data can originate in the data collection process, whether this is occurring with manual data entry and typos, or with outdated collection forms that hold multiple variables that mean the same thing. For example, while collecting data on marine mammals, it is important to note who the observer is. With Python and R, reading excel or csv files, these languages will take the same variable written as, "Melissa Laurino" and "melissa laurino" as two separate observers because they are case sensitive. However, this is not accurate because they are meant to be the same person within the observer column or category.</br>
Clean data is important for consistency that leads to accurate results and analysis. If we are using our data to make informed decisions in our field, we need it to be clean. We do not want to omit rows that may make a difference to our dataset because they do not fit a certain criteria due to typos, but how much should the original dataset be altered? Depending on your field, there may be regulations and compliance standards regarding data quality. Protocols may state if the data does not read exactly how it should be, then it should be ommitted. </br>
For our learning objectives in this class, we will clean our data. Our first assignment in our warehousing journey was important because it allowed us to gain a better understanding of a dataset that we personally did not collect. Now that we have that understanding, we can explore it in greater depth and clean it as necessary.<br>
<br>
It is important when cleaning data to: <br>
*Make detailed comments with your code* <br>
*Record EVERYTHING ommitted and changed if necessary* <br>
*Since we are exploring and learning without a specific organization policy, use your best judgement when ommitting records. If you have chosen to ommit data, please explain why.*</br>
<br>
<b>The code that I have written below is just to give you ideas on exploring and cleaning data. It is encouraged that you explore and clean it in greater detail than what I have written below for full credit.</b><br>
Additional examples: https://epirhandbook.com/en/cleaning-data-and-core-functions.html

<b>Dataset name: male_players.csv</b><br>
<b>Company/Government Organization: Kaggle</b><br>
Download link: https://www.kaggle.com/datasets/stefanoleone992/ea-sports-fc-24-complete-player-dataset?resource=download

Load necessary libraries:

In [43]:
# Loading in libraries neccessary
import pandas as pd # Reading data
import matplotlib.pyplot as plt # Graphing data
from IPython.display import display # Reading data in tabular form

# Remove warnings
import warnings 
warnings.filterwarnings('ignore')


# Make it so all columns are shown
pd.set_option('display.max_columns', None)

Load data into R:

In [3]:
# Loading in data
file_path = '/Users/udoychowdhury/Documents/DataScience/Soccer Data/male_players.csv'
    # Specified data type for column 108 since it has mixed data types
maleplayersdf = pd.read_csv(file_path, dtype={108: str})

Exploration before cleaning:

In [13]:
# Display the structure of the dataset
maleplayersdf.info()

# Display a summary of the numerical dataset
maleplayersdf.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180021 entries, 0 to 180020
Columns: 109 entries, player_id to gk
dtypes: float64(20), int64(43), object(46)
memory usage: 149.7+ MB


Unnamed: 0,player_id,fifa_version,fifa_update,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,club_team_id,league_id,league_level,club_jersey_number,club_contract_valid_until_year,nationality_id,nation_team_id,nation_jersey_number,weak_foot,skill_moves,international_reputation,release_clause_eur,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed
count,180021.0,180021.0,180021.0,180021.0,180021.0,177868.0,178173.0,180021.0,180021.0,180021.0,178156.0,178156.0,177771.0,178156.0,178156.0,180021.0,10098.0,10098.0,180021.0,180021.0,180021.0,120722.0,159997.0,159997.0,159997.0,159997.0,159997.0,159997.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,147133.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,180021.0,20024.0
mean,217326.670294,19.62145,2.0,65.712711,70.779581,2379142.0,10638.01081,25.138689,181.287061,75.233356,45263.72821,221.747991,1.380878,20.302297,2020.816015,55.657218,29845.186671,12.22658,2.939657,2.335689,1.105171,4878321.0,68.058839,52.268155,56.995731,62.160484,51.232742,64.824322,49.622605,45.687803,52.20572,58.43451,42.974686,55.296321,47.351876,43.028008,52.709312,58.251404,64.69508,64.866871,63.383,61.588059,63.951445,56.812672,65.02312,63.068586,65.128474,47.101505,55.714289,46.6177,50.047078,53.187567,48.579993,57.837807,45.851456,47.759511,45.764783,16.509979,16.260136,16.129902,16.276951,16.615517,38.7503
std,35215.749284,2.838621,0.0,7.018104,6.255569,6184358.0,21637.414,4.679389,6.764179,6.999181,53516.528046,467.804515,0.750647,17.054347,2.902942,48.050844,47980.43048,6.911397,0.664775,0.754452,0.381699,12717950.0,11.063818,13.920686,10.427677,10.220788,16.580676,9.755666,17.985783,19.281183,17.146296,14.651313,17.4494,18.614112,18.059794,17.196317,15.172114,16.576721,14.789598,14.543439,14.725512,9.144995,14.082868,15.290486,11.898887,15.875722,12.615282,19.051298,17.176139,20.414351,19.23163,14.193916,15.660671,12.305995,20.452502,21.309061,20.899683,17.661659,16.846583,16.499513,17.009393,17.971201,10.578237
min,2.0,15.0,2.0,40.0,40.0,1000.0,500.0,16.0,154.0,49.0,1.0,1.0,1.0,1.0,2014.0,1.0,974.0,1.0,1.0,1.0,1.0,9000.0,21.0,14.0,20.0,22.0,14.0,27.0,5.0,2.0,4.0,7.0,3.0,2.0,4.0,3.0,5.0,5.0,11.0,11.0,11.0,20.0,10.0,2.0,13.0,10.0,12.0,3.0,2.0,3.0,2.0,3.0,5.0,3.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,12.0
25%,200759.0,17.0,2.0,61.0,66.0,325000.0,2000.0,21.0,176.0,70.0,450.0,19.0,1.0,8.0,2019.0,21.0,1343.0,6.0,3.0,2.0,1.0,633250.0,62.0,42.0,50.0,56.0,36.0,58.0,38.0,30.0,44.0,53.0,30.0,48.0,35.0,31.0,43.0,54.0,57.0,57.0,55.0,55.0,56.0,47.0,58.0,56.0,58.0,32.0,44.0,26.0,39.0,44.0,39.0,50.0,26.0,27.0,25.0,8.0,8.0,8.0,8.0,8.0,30.0
50%,222734.0,20.0,2.0,66.0,71.0,750000.0,4000.0,25.0,181.0,75.0,1891.0,56.0,1.0,17.0,2021.0,45.0,1365.0,12.0,3.0,2.0,1.0,1400000.0,69.0,54.0,58.0,63.0,56.0,66.0,54.0,49.0,55.0,62.0,44.0,61.0,49.0,42.0,56.0,63.0,67.0,67.0,66.0,62.0,66.0,59.0,66.0,66.0,66.0,51.0,58.0,52.0,55.0,55.0,50.0,59.0,51.0,55.0,52.0,11.0,11.0,11.0,11.0,11.0,40.0
75%,239858.0,22.0,2.0,70.0,75.0,1800000.0,10000.0,28.0,186.0,80.0,110912.0,308.0,2.0,27.0,2023.0,56.0,105035.0,18.0,3.0,3.0,1.0,3600000.0,76.0,63.0,64.0,69.0,64.0,72.0,63.0,62.0,64.0,68.0,56.0,68.0,61.0,56.0,64.0,69.0,75.0,75.0,74.0,68.0,74.0,68.0,73.0,74.0,74.0,62.0,69.0,64.0,64.0,64.0,60.0,66.0,63.0,66.0,64.0,14.0,14.0,14.0,14.0,14.0,46.0
max,278145.0,24.0,2.0,94.0,95.0,194000000.0,575000.0,54.0,208.0,110.0,131389.0,2149.0,5.0,99.0,2032.0,219.0,111527.0,97.0,5.0,5.0,5.0,373500000.0,97.0,94.0,94.0,96.0,91.0,92.0,95.0,96.0,95.0,95.0,93.0,97.0,94.0,95.0,95.0,96.0,97.0,97.0,96.0,96.0,97.0,96.0,97.0,97.0,98.0,94.0,96.0,93.0,96.0,96.0,96.0,96.0,94.0,94.0,95.0,91.0,92.0,95.0,92.0,94.0,68.0


In [22]:
# Check for missing values
missing_values = maleplayersdf.isna().sum()

# Filter and display columns with missing values greater than 0
missing_columns = missing_values[missing_values > 0]
display(missing_columns)

value_eur                           2153
wage_eur                            1848
club_team_id                        1865
club_name                           1865
league_id                           1865
league_name                         1865
league_level                        2250
club_position                       1865
club_jersey_number                  1865
club_loaned_from                  169298
club_joined_date                   12588
club_contract_valid_until_year      1865
nation_team_id                    169923
nation_position                   169923
nation_jersey_number              169923
release_clause_eur                 59299
player_tags                       166103
player_traits                      98216
pace                               20024
shooting                           20024
passing                            20024
dribbling                          20024
defending                          20024
physic                             20024
mentality_compos

What columns are missing values (If any)? Do you think you should remove the rows of data at this time in the exploration? Why or why not?

Out of the 109 columns, there is 20+ columns that have missing columns. I am going to remove the rows that have an empty pace, shooting, passing, dribbiling, defending, and physic since those are pretty much the main data point in fifa player cards. Without it, there is not much to get from the row.

If you chose to remove rows with specific missing values:

In [68]:
# Specify the columns I mentioned above for the code to look through
columns_to_check = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

# Will create a new df that has no null values in the columns i mentioned above
cleandata = maleplayersdf.dropna(subset=columns_to_check)

# Check if they were dropped, I had 180020 rows initially
new_row_count = cleandata.shape[0]
print(f"New Row Count: {new_row_count}")

New Row Count: 159997


What about duplicates?

In [42]:
# Find all duplicate rows
duplicate_rows = cleandata[cleandata.duplicated(keep=False)]
print("Duplicate Rows:")
print(duplicate_rows)

Duplicate Rows:
Empty DataFrame
Columns: [player_id, player_url, fifa_version, fifa_update, update_as_of, short_name, long_name, player_positions, overall, potential, value_eur, wage_eur, age, dob, height_cm, weight_kg, club_team_id, club_name, league_id, league_name, league_level, club_position, club_jersey_number, club_loaned_from, club_joined_date, club_contract_valid_until_year, nationality_id, nationality_name, nation_team_id, nation_position, nation_jersey_number, preferred_foot, weak_foot, skill_moves, international_reputation, work_rate, body_type, real_face, release_clause_eur, player_tags, player_traits, pace, shooting, passing, dribbling, defending, physic, attacking_crossing, attacking_finishing, attacking_heading_accuracy, attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accuracy, skill_long_passing, skill_ball_control, movement_acceleration, movement_sprint_speed, movement_agility, movement_reactions, movement_balance, power_shot_power, p

In [None]:
# Remove duplicates

# The dataset has no duplicates to remove

Let's revisit the structure and look at the data types for each column. This will be important for SQL.

In [48]:
# Original code was not showing all columns data types since there is too many
    # So I am putting it into a table instead
data_types = pd.DataFrame(cleandata.dtypes, columns=['Data Type'])

# Transposing the dataframe since its still not showing all columns as it is vertical
data_types_transposed = data_types.transpose()

# Print
print("Data Types:")
display(data_types_transposed)

Data Types:


Unnamed: 0,player_id,player_url,fifa_version,fifa_update,update_as_of,short_name,long_name,player_positions,overall,potential,value_eur,wage_eur,age,dob,height_cm,weight_kg,club_team_id,club_name,league_id,league_name,league_level,club_position,club_jersey_number,club_loaned_from,club_joined_date,club_contract_valid_until_year,nationality_id,nationality_name,nation_team_id,nation_position,nation_jersey_number,preferred_foot,weak_foot,skill_moves,international_reputation,work_rate,body_type,real_face,release_clause_eur,player_tags,player_traits,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb,gk
Data Type,int64,object,float64,float64,datetime64[ns],object,object,object,int64,int64,float64,float64,int64,datetime64[ns],int64,int64,int64,object,int64,object,int64,object,int64,object,datetime64[ns],object,int64,object,int64,object,int64,object,int64,int64,int64,object,object,object,float64,object,object,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object


In [124]:
# List columns to convert into dates then convert it
date_columns = ['update_as_of', 'dob', 'club_joined_date']

for col in date_columns:
    cleandata[col] = pd.to_datetime(cleandata[col])

# Convert a column to object type
cleandata['club_contract_valid_until_year'] = cleandata['club_contract_valid_until_year'].astype(str)

# List columns to convert into ints then convert it
int_columns = ['fifa_version','release_clause_eur', 'club_contract_valid_until_year', 'value_eur', 'wage_eur','club_jersey_number', 'league_id', 'club_team_id', 'league_level',
               'nation_team_id', 'nation_jersey_number', 'pace', 'shooting',
               'passing', 'dribbling', 'defending', 'physic', 'mentality_composure']

for col in int_columns:
    # I had to add errors code since it was not converting it into ints without it
    cleandata[col] = pd.to_numeric(cleandata[col], errors='coerce').fillna(0).astype(int)

# Print the updated data types for each column
new_data_types = pd.DataFrame(cleandata.dtypes, columns=['Data Type'])

# Transposing the dataframe since its still not showing all columns as it is vertical
new_data_types_transposed = new_data_types.transpose()

# Print
print("Data Types:")
display(new_data_types_transposed)

Data Types:


Unnamed: 0,player_id,fifa_version,short_name,long_name,player_positions,overall,potential,value_eur,value_dol,wage_eur,wage_dol,age,dob,height_cm,height_in,weight_kg,weight_lbs,club_team_id,club_name,league_id,league_name,league_level,club_position,club_jersey_number,club_loaned_from,club_joined_date,club_contract_valid_until_year,nationality_id,nationality_name,nation_team_id,nation_position,nation_jersey_number,preferred_foot,weak_foot,skill_moves,work_rate,body_type,release_clause_eur,release_clause_dol,player_tags,player_traits,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb,gk
Data Type,int64,int64,object,object,object,int64,int64,int64,int64,int64,int64,int64,datetime64[ns],int64,float64,int64,float64,int64,object,int64,object,int64,object,int64,object,datetime64[ns],int64,int64,object,int64,object,int64,object,int64,int64,object,object,int64,int64,object,object,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object


Changing text characters in your data. Make all column names lowercase. Lowercase is easier to read in SQL when we get to that point.

In [71]:
# Convert all column names to lowercase
cleandata.columns = [col.lower() for col in cleandata.columns]

# Check if it worked
column_names = list(cleandata.columns)
print(column_names)

['player_id', 'player_url', 'fifa_version', 'fifa_update', 'update_as_of', 'short_name', 'long_name', 'player_positions', 'overall', 'potential', 'value_eur', 'wage_eur', 'age', 'dob', 'height_cm', 'weight_kg', 'club_team_id', 'club_name', 'league_id', 'league_name', 'league_level', 'club_position', 'club_jersey_number', 'club_loaned_from', 'club_joined_date', 'club_contract_valid_until_year', 'nationality_id', 'nationality_name', 'nation_team_id', 'nation_position', 'nation_jersey_number', 'preferred_foot', 'weak_foot', 'skill_moves', 'international_reputation', 'work_rate', 'body_type', 'real_face', 'release_clause_eur', 'player_tags', 'player_traits', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed'

Assignment #1 asked you to create a graph and check for outliers. Are there any outliers in your columns? How can we check for outliers?

In [91]:
# Select only numeric columns
numeric_cols = cleandata.select_dtypes(include=['float64', 'int64'])

# Find mean and standard deviation of each column
mean = numeric_cols.mean()
std = numeric_cols.std()

# Print the mean in a df
print("\nMean")
meandf = pd.DataFrame(mean, columns=['Mean'])
meandf = meandf.transpose()
display(meandf)

# Print the std in a df
print("\nStandard Deviation")
stddf = pd.DataFrame(std, columns=['std'])
stddf = stddf.transpose()
display(stddf)

# Anything outside of the mean by 3 standard deviations is an outlier
# Define the z-score threshold for outliers
z_score_threshold = 3

# Dictionary for each column
outliers_dfs = {}

# Iterate over each numeric column
for col in numeric_cols.columns:
    # Calculate the absolute z-score for each data point
        # Subtract the value by the mean then divide it by the standard deviation
    z_scores = ((numeric_cols[col] - mean[col]) / std[col]).abs()
    
    # Identify outliers
    outliers = numeric_cols[col][z_scores > z_score_threshold]

    # Print the outliers for each column
    print(f"Outliers in column '{col}':")
    print(outliers)


Mean


Unnamed: 0,player_id,fifa_version,fifa_update,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,club_team_id,league_id,league_level,club_jersey_number,nationality_id,nation_team_id,nation_jersey_number,weak_foot,skill_moves,international_reputation,release_clause_eur,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed
Mean,218061.978912,19.621062,2.0,65.890042,70.956268,2465144.0,10978.172792,25.017832,180.410795,74.403895,44767.489415,219.332919,1.366451,19.98325,55.962418,1638.118602,0.67625,2.996175,2.502853,1.106708,5079579.0,68.058839,52.268155,56.995731,62.160484,51.232742,64.824322,53.946086,49.78354,56.841241,62.330819,46.684288,60.380063,51.3329,46.513241,56.026657,63.022157,67.965418,68.114702,66.29288,61.938299,66.53941,59.501653,65.937905,67.129827,65.684713,51.302693,59.384551,50.275499,54.771702,55.344375,52.131127,49.019607,49.742176,51.860341,49.626174,10.42177,10.465859,10.468078,10.442696,10.423745,



Standard Deviation


Unnamed: 0,player_id,fifa_version,fifa_update,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,club_team_id,league_id,league_level,club_jersey_number,nationality_id,nation_team_id,nation_jersey_number,weak_foot,skill_moves,international_reputation,release_clause_eur,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed
std,34158.197674,2.839239,0.0,6.912752,6.206342,6277250.0,22066.1222,4.555788,6.461772,6.639845,53431.125886,465.548093,0.7637,16.705307,48.088474,13135.577596,3.207441,0.642219,0.623866,0.383534,12925960.0,11.063818,13.920686,10.427677,10.220788,16.580676,9.755666,13.897311,16.240831,11.606409,9.88218,14.66,12.411786,14.867467,14.83217,12.304423,10.009345,11.599362,11.324926,12.313158,8.921312,12.137752,13.27336,11.654584,11.290892,12.676394,15.676766,14.265517,18.543649,14.538365,12.816575,12.499706,25.01723,18.166305,18.899758,18.838614,3.198493,3.165451,3.232582,3.17674,3.201771,


Outliers in column 'player_id':
49         20801
3929       18115
4792       25798
10242      23823
18356      20801
           ...  
178380     18756
178381     23467
178382    100557
178764     19699
179044     36140
Name: player_id, Length: 1905, dtype: int64
Outliers in column 'fifa_version':
Series([], Name: fifa_version, dtype: float64)
Outliers in column 'fifa_update':
Series([], Name: fifa_update, dtype: float64)
Outliers in column 'overall':
0         91
1         91
2         91
3         90
4         90
          ..
180016    41
180017    41
180018    40
180019    40
180020    40
Name: overall, Length: 388, dtype: int64
Outliers in column 'potential':
0         94
1         94
2         91
3         90
4         90
          ..
180008    50
180017    50
180018    50
180019    49
180020    40
Name: potential, Length: 449, dtype: int64
Outliers in column 'value_eur':
0         181500000.0
1         185000000.0
2         103000000.0
3          41000000.0
4          51000000.0
 

Outliers in column 'skill_ball_control':
12        94
6045      32
11710     31
12046     29
12781     30
          ..
180004    32
180005    27
180016    26
180017    30
180018    32
Name: skill_ball_control, Length: 1365, dtype: int64
Outliers in column 'movement_acceleration':
325       32
571       33
749       32
750       31
937       33
          ..
179046    33
179053    33
179228    32
179275    32
179620    33
Name: movement_acceleration, Length: 1648, dtype: int64
Outliers in column 'movement_sprint_speed':
314       33
749       32
750       31
951       32
968       33
          ..
179620    33
179678    32
179732    34
179761    29
179781    33
Name: movement_sprint_speed, Length: 2300, dtype: int64
Outliers in column 'movement_agility':
2107      29
3190      28
3243      29
4858      28
8055      29
          ..
176247    29
176918    29
178745    28
178933    29
179117    23
Name: movement_agility, Length: 269, dtype: int64
Outliers in column 'movement_reactions':
0   

<b>To create additional steps for data cleaning in Jupyter notebook: </b><br>
Hit the plus button in the top left corner to add a row of code. <br>
To change from code to text or headers, select from the drop down menu above. <br>
Use "< b r >" (No spaces or quotes) to skip a line in markdown and other HTML text font options.

Additional step #1:

In [119]:
# Creating new columns that can be more understood
    # Converting columns that contain euros, cm, and kg to dollar, in, and lbs

# Initialize conversion factors 
eur_to_usd = 1.1 
cm_to_inch = 0.393701 
kg_to_lb = 2.20462 

# Creating new columns with conversions
    # Euro to Dollar
cleandata['value_dol'] = (cleandata['value_eur'] * eur_to_usd).fillna(0).astype(int)
cleandata['wage_dol'] = (cleandata['wage_eur'] * eur_to_usd).fillna(0).astype(int)
cleandata['release_clause_dol'] = (cleandata['release_clause_eur'] * eur_to_usd).fillna(0).astype(int)

    # cm to in
cleandata['height_in'] = (cleandata['height_cm'] * cm_to_inch).round(2)
    # kg to lbs
cleandata['weight_lbs'] = (cleandata['weight_kg'] * kg_to_lb).round(2)

Additional step #2:

In [120]:
# Change the column order so the new columns are next to the original columns
    # Define the new order
    
new_column_order = ['player_id', 'player_url', 'fifa_version', 'fifa_update', 
                    'update_as_of', 'short_name', 'long_name', 'player_positions', 
                    'overall', 'potential', 'value_eur', 'value_dol', 'wage_eur', 'wage_dol',
                    'age', 'dob', 'height_cm', 'height_in', 'weight_kg', 'weight_lbs', 
                    'club_team_id', 'club_name', 'league_id', 'league_name', 'league_level', 
                    'club_position', 'club_jersey_number', 'club_loaned_from', 'club_joined_date',
                    'club_contract_valid_until_year', 'nationality_id', 'nationality_name', 
                    'nation_team_id', 'nation_position', 'nation_jersey_number', 'preferred_foot',
                    'weak_foot', 'skill_moves', 'international_reputation', 'work_rate', 'body_type', 
                    'real_face', 'release_clause_eur', 'release_clause_dol',
                    'player_tags', 'player_traits', 'pace', 'shooting', 'passing', 'dribbling', 
                    'defending', 'physic', 'attacking_crossing','attacking_finishing', 
                    'attacking_heading_accuracy', 'attacking_short_passing', 
                    'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 
                    'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 
                    'movement_sprint_speed', 'movement_agility', 'movement_reactions', 
                    'movement_balance', 'power_shot_power', 'power_jumping', 'power_stamina', 
                    'power_strength', 'power_long_shots', 'mentality_aggression', 
                    'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 
                    'mentality_penalties', 'mentality_composure', 'defending_marking_awareness',
                    'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 
                    'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 
                    'goalkeeping_reflexes', 'goalkeeping_speed', 'ls', 'st', 'rs', 'lw', 'lf', 
                    'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 
                    'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', 'gk']

# Apply the new order
cleandata = cleandata[new_column_order]

# Check if it worked
display(cleandata.head())

Unnamed: 0,player_id,player_url,fifa_version,fifa_update,update_as_of,short_name,long_name,player_positions,overall,potential,value_eur,value_dol,wage_eur,wage_dol,age,dob,height_cm,height_in,weight_kg,weight_lbs,club_team_id,club_name,league_id,league_name,league_level,club_position,club_jersey_number,club_loaned_from,club_joined_date,club_contract_valid_until_year,nationality_id,nationality_name,nation_team_id,nation_position,nation_jersey_number,preferred_foot,weak_foot,skill_moves,international_reputation,work_rate,body_type,real_face,release_clause_eur,release_clause_dol,player_tags,player_traits,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb,gk
0,231747,/player/231747/kylian-mbappe/240002,24.0,2.0,2023-09-22,K. Mbappé,Kylian Mbappé Lottin,"ST, LW",91,94,181500000,199650000,230000,253000,24,1998-12-20,182,71.65,75,165.35,73,Paris Saint Germain,16,Ligue 1,1,LW,7,,2018-07-01,2024,18,France,1335,LW,10,Right,4,5,5,High/Low,Unique,Yes,349400000,384340000,"#Speedster, #Dribbler, #Acrobat, #Clinical fin...","Quick Step +, Rapid, Flair, Trivela",97,90,80,92,36,78,78,94,73,86,84,93,80,69,71,92,97,97,93,93,82,90,88,88,77,83,64,38,93,83,84,88,26,34,32,13,5,7,11,6,,90+3,90+3,90+3,91,91,91,91,91,89+3,89+3,89+3,89+3,81+3,81+3,81+3,89+3,68+3,63+3,63+3,63+3,68+3,63+3,54+3,54+3,54+3,63+3,18+3
1,239085,/player/239085/erling-haaland/240002,24.0,2.0,2023-09-22,E. Haaland,Erling Braut Haaland,ST,91,94,185000000,203500000,340000,374000,22,2000-07-21,195,76.77,94,207.23,10,Manchester City,13,Premier League,1,ST,9,,2022-07-01,2027,36,Norway,1352,ST,9,Left,3,3,5,High/Medium,Unique,Yes,356100000,391710000,"#Aerial threat, #Distance shooter, #Strength, ...","Acrobatic +, Power Header, Quick Step",89,93,66,80,45,88,47,96,83,77,90,79,77,62,53,82,82,94,76,94,72,94,93,76,93,86,87,43,96,74,84,87,38,47,29,7,14,13,11,7,,90+3,90+3,90+3,82,86,86,86,82,82+3,82+3,82+3,79+3,74+3,74+3,74+3,79+3,62+3,63+3,63+3,63+3,62+3,60+3,62+3,62+3,62+3,60+3,19+3
2,192985,/player/192985/kevin-de-bruyne/240002,24.0,2.0,2023-09-22,K. De Bruyne,Kevin De Bruyne,"CM, CAM",91,91,103000000,113300000,350000,385000,32,1991-06-28,181,71.26,75,165.35,10,Manchester City,13,Premier League,1,SUB,17,,2015-08-30,2025,7,Belgium,1325,CAM,7,Right,5,4,5,High/Medium,Unique,Yes,190600000,209660000,"#Dribbler, #Playmaker, #Distance shooter, #Cro...","Pinged Pass +, Dead Ball, Incisive Pass, Long ...",72,88,94,87,65,78,95,85,55,94,83,86,92,83,94,92,72,72,74,92,78,92,72,88,74,92,75,66,88,95,83,88,66,70,53,15,13,5,10,13,,83+3,83+3,83+3,87,88,88,88,87,89+2,89+2,89+2,88+3,90+1,90+1,90+1,88+3,79+3,80+3,80+3,80+3,79+3,75+3,70+3,70+3,70+3,75+3,21+3
3,158023,/player/158023/lionel-messi/240002,24.0,2.0,2023-09-22,L. Messi,Lionel Andrés Messi Cuccittini,"CF, CAM",90,90,41000000,45100000,23000,25300,36,1987-06-24,169,66.54,67,147.71,112893,Inter Miami,39,Major League Soccer,1,RF,10,,2023-07-16,2025,52,Argentina,1369,RW,10,Left,4,4,5,Low/Low,Unique,Yes,61500000,67650000,"#Dribbler, #Playmaker, #FK Specialist, #Acroba...","Technical +, Finesse Shot, Dead Ball, Pinged P...",80,87,90,94,33,64,83,89,60,91,86,96,93,93,90,93,87,74,91,88,95,83,71,70,68,90,44,40,91,92,75,96,20,35,24,6,11,15,14,8,,85+3,85+3,85+3,90,89,89,89,90,91-1,91-1,91-1,89+1,85+3,85+3,85+3,89+1,64+3,63+3,63+3,63+3,64+3,59+3,49+3,49+3,49+3,59+3,19+3
4,165153,/player/165153/karim-benzema/240002,24.0,2.0,2023-09-22,K. Benzema,Karim Benzema,"CF, ST",90,90,51000000,56100000,95000,104500,35,1987-12-19,185,72.83,81,178.57,607,Al Ittihad,350,Pro League,1,RS,9,,2023-07-01,2026,18,France,0,,0,Right,4,4,5,Medium/Medium,Normal (170-185),Yes,81600000,89760000,"#Poacher, #Aerial threat, #Clinical finisher, ...","Finesse Shot +, Dead Ball, Pinged Pass, Tiki T...",79,88,83,87,39,78,75,91,90,89,88,87,82,73,76,91,78,79,77,92,72,87,85,82,82,81,63,39,92,90,85,90,43,24,18,13,11,5,5,7,,88+2,88+2,88+2,86,89,89,89,86,88+2,88+2,88+2,86+3,82+3,82+3,82+3,86+3,64+3,64+3,64+3,64+3,64+3,60+3,55+3,55+3,55+3,60+3,18+3


Additional step #3:

In [127]:
# Dropping unneeded columns
cleandata = cleandata.drop(columns=['player_url', 'update_as_of', 'fifa_update', 
                                    'international_reputation', 'real_face'])

# Check if it worked
display(cleandata.head())

Unnamed: 0,player_id,fifa_version,short_name,long_name,player_positions,overall,potential,value_eur,value_dol,wage_eur,wage_dol,age,dob,height_cm,height_in,weight_kg,weight_lbs,club_team_id,club_name,league_id,league_name,league_level,club_position,club_jersey_number,club_loaned_from,club_joined_date,club_contract_valid_until_year,nationality_id,nationality_name,nation_team_id,nation_position,nation_jersey_number,preferred_foot,weak_foot,skill_moves,work_rate,body_type,release_clause_eur,release_clause_dol,player_tags,player_traits,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb,gk
0,231747,24,K. Mbappé,Kylian Mbappé Lottin,"ST, LW",91,94,181500000,199650000,230000,253000,24,1998-12-20,182,71.65,75,165.35,73,Paris Saint Germain,16,Ligue 1,1,LW,7,,2018-07-01,2024,18,France,1335,LW,10,Right,4,5,High/Low,Unique,349400000,384340000,"#Speedster, #Dribbler, #Acrobat, #Clinical fin...","Quick Step +, Rapid, Flair, Trivela",97,90,80,92,36,78,78,94,73,86,84,93,80,69,71,92,97,97,93,93,82,90,88,88,77,83,64,38,93,83,84,88,26,34,32,13,5,7,11,6,,90+3,90+3,90+3,91,91,91,91,91,89+3,89+3,89+3,89+3,81+3,81+3,81+3,89+3,68+3,63+3,63+3,63+3,68+3,63+3,54+3,54+3,54+3,63+3,18+3
1,239085,24,E. Haaland,Erling Braut Haaland,ST,91,94,185000000,203500000,340000,374000,22,2000-07-21,195,76.77,94,207.23,10,Manchester City,13,Premier League,1,ST,9,,2022-07-01,2027,36,Norway,1352,ST,9,Left,3,3,High/Medium,Unique,356100000,391710000,"#Aerial threat, #Distance shooter, #Strength, ...","Acrobatic +, Power Header, Quick Step",89,93,66,80,45,88,47,96,83,77,90,79,77,62,53,82,82,94,76,94,72,94,93,76,93,86,87,43,96,74,84,87,38,47,29,7,14,13,11,7,,90+3,90+3,90+3,82,86,86,86,82,82+3,82+3,82+3,79+3,74+3,74+3,74+3,79+3,62+3,63+3,63+3,63+3,62+3,60+3,62+3,62+3,62+3,60+3,19+3
2,192985,24,K. De Bruyne,Kevin De Bruyne,"CM, CAM",91,91,103000000,113300000,350000,385000,32,1991-06-28,181,71.26,75,165.35,10,Manchester City,13,Premier League,1,SUB,17,,2015-08-30,2025,7,Belgium,1325,CAM,7,Right,5,4,High/Medium,Unique,190600000,209660000,"#Dribbler, #Playmaker, #Distance shooter, #Cro...","Pinged Pass +, Dead Ball, Incisive Pass, Long ...",72,88,94,87,65,78,95,85,55,94,83,86,92,83,94,92,72,72,74,92,78,92,72,88,74,92,75,66,88,95,83,88,66,70,53,15,13,5,10,13,,83+3,83+3,83+3,87,88,88,88,87,89+2,89+2,89+2,88+3,90+1,90+1,90+1,88+3,79+3,80+3,80+3,80+3,79+3,75+3,70+3,70+3,70+3,75+3,21+3
3,158023,24,L. Messi,Lionel Andrés Messi Cuccittini,"CF, CAM",90,90,41000000,45100000,23000,25300,36,1987-06-24,169,66.54,67,147.71,112893,Inter Miami,39,Major League Soccer,1,RF,10,,2023-07-16,2025,52,Argentina,1369,RW,10,Left,4,4,Low/Low,Unique,61500000,67650000,"#Dribbler, #Playmaker, #FK Specialist, #Acroba...","Technical +, Finesse Shot, Dead Ball, Pinged P...",80,87,90,94,33,64,83,89,60,91,86,96,93,93,90,93,87,74,91,88,95,83,71,70,68,90,44,40,91,92,75,96,20,35,24,6,11,15,14,8,,85+3,85+3,85+3,90,89,89,89,90,91-1,91-1,91-1,89+1,85+3,85+3,85+3,89+1,64+3,63+3,63+3,63+3,64+3,59+3,49+3,49+3,49+3,59+3,19+3
4,165153,24,K. Benzema,Karim Benzema,"CF, ST",90,90,51000000,56100000,95000,104500,35,1987-12-19,185,72.83,81,178.57,607,Al Ittihad,350,Pro League,1,RS,9,,2023-07-01,2026,18,France,0,,0,Right,4,4,Medium/Medium,Normal (170-185),81600000,89760000,"#Poacher, #Aerial threat, #Clinical finisher, ...","Finesse Shot +, Dead Ball, Pinged Pass, Tiki T...",79,88,83,87,39,78,75,91,90,89,88,87,82,73,76,91,78,79,77,92,72,87,85,82,82,81,63,39,92,90,85,90,43,24,18,13,11,5,5,7,,88+2,88+2,88+2,86,89,89,89,86,88+2,88+2,88+2,86+3,82+3,82+3,82+3,86+3,64+3,64+3,64+3,64+3,64+3,60+3,55+3,55+3,55+3,60+3,18+3


Lets save our new CLEAN data :) 

In [129]:
# Save the newly cleaned dataset as a NEW file:
cleandata.to_csv('male_players_cleaned.csv', index=False)