**Hello Everyone,

Forked from: https://www.kaggle.com/smzimran/geographical-overview-world-football
* Add some target columns. 

* Uses two additional datasets from Kaggle, to borrow the longitude and latitude coordinates of various cities where  matches were  played.

In [16]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

results = pd.read_csv('../input/international-football-results-from-1872-to-2017/results.csv')
results.head(10)

I have extracted the day, month, year values from the date as it is more usefull. Additionally, I have added a new column to check who won each game and in case there was a draw.

Although I will not be using these new features in this notebook, anyone is welcome to borrow the code to use for further analysis.

In [17]:
def find_winners(df):
    winners = []
    for i, row in df.iterrows():
        if row['home_score'] > row['away_score']:
            winners.append(row['home_team'])
        elif row['home_score'] < row['away_score']:
            winners.append(row['away_team'])
        else:
            winners.append('Draw')
    return winners
results['winner'] = find_winners(results)
results.head()

### More "leaky" possible target columns
* note that these columns can be feature for aggregate feature engineering, but must be excluded from the primary data (as they include the target / who won).
    * Measure of how extreme the point difference in a match was
    * home/away/draw win indicator (regardless of country).

In [18]:
def find_team_win(df):
    win = []
    for i, row in df.iterrows():
        if row['home_score'] > row['away_score']:
            win.append('home_team')
        elif row['home_score'] < row['away_score']:
            win.append('away_team')
        else:
            win.append('Draw')
    return win
results['home_away_win'] = find_team_win(results)

In [19]:
results["home_subtract_away_score_diff"] = results["home_score"]-results["away_score"]

Mention for matches played at a neutral venue were missing from the dataset. Add the neutral venue matches indicator as a new column.

In [20]:
neutral = []
for index, row in results.iterrows():
    neutral.append((row['country'] not in  row['home_team']) and (row['home_team'] != 'USA'))
    
results['neutral'] = neutral
results['neutral'] = results['neutral'].astype(int)

results.head()

In [21]:
results.isnull().sum()

#### Data from the World Cities dataset to find latitude and longitude values.

In [22]:
cities = pd.read_csv('../input/world-cities-database/worldcitiespop.csv', encoding='latin-1', dtype={
    'Country': str,
    'City': str,
    'AccentCity': str,
    'Region': str
})
cities = cities.dropna()
cities = cities[['Country', 'AccentCity', 'Latitude', 'Longitude']]
cities.head()

In [23]:
countries = pd.read_csv('../input/world-countries-and-continents-details/countries and continents.csv')
countries = countries.rename(columns = {'official_name_en': 'Name', 'ISO3166-1-Alpha-2': 'Code'})
countries = countries[['Name', 'Code']]
countries['Code'] = countries['Code'].str.lower()

In [24]:
coords = results[['city', 'country']]
df = coords.merge(cities, how='inner', left_on='city', right_on='AccentCity')
new_df = df.merge(countries, how='left', left_on='Country', right_on='Code')
city_coords = new_df[['city', 'Latitude', 'Longitude']]
city_coords = city_coords.drop_duplicates('city')
city_coords = city_coords.rename(columns={'Latitude':'latitude', 'Longitude': 'longitude'})

Finally, I have merged the geographical coordinates data to the original dataset and is available in the combined dataframe named 'new_results'.

In [25]:
new_results = results.merge(city_coords, how='left', on='city')
new_results.head()

In [26]:
new_results.to_csv("international-football-results.csv.gz",compression="gzip",index=False)