# Joins Datasets
"Joins" in datasets or dataframes refer to the operation of combining two sets of data based on one or more common columns. This operation is fundamental in data analysis and is used to merge information from different data sources based on some relationship between them.

### Types of Joins
1. **Inner Join**: Only includes rows that have matching values in both tables.
2. **Left Join**: Includes all rows from the left table (first dataframe) and matching rows from the right table (second dataframe).
3. **Right Join**: Includes all rows from the right table and matching rows from the left table.
4. **Outer Join (or Full Outer Join)**: Includes all rows from both tables, combining matching rows and filling in missing values with NaN or null values.
### Join Columns
Joins are commonly performed on one or more columns that have matching values in both tables. These columns are known as join keys.
### Libraries for Performing Joins
In Python, the most common libraries for working with datasets or dataframes and performing joins are pandas (for dataframes) and the SQL library (for relational databases).
### Syntax in pandas
In pandas, you can perform joins using the pd.merge() function or the df.merge() and df.join() methods, which allow you to specify the dataframes to be combined, the join columns, and the type of join you want to perform.
### Efficiency and Performance 
The performance of joins can be an important factor when working with large datasets. It's important to understand how joins work internally and how to optimize them for the best possible performance.

<div class="alert alert-block alert-info">
<b>💡:</b> Joins are a powerful tool for combining and relating data from different sources in data analysis and data processing. Understanding the different types of "joins" and how to perform them correctly is essential to working with datasets or dataframes effectively.
</div>

---

In [16]:
import pandas as pd

**Note about the dataset:** Apparently the data was extracted, stored and processed incorrectly by omitting relevant information to correctly determine each athlete and the competitions where they won medals. For this reason, some concatenations can be difficult to understand.

In [17]:
# Dtermine the folder containing
filepath = "../datasets/athletes/"

# Create dataframes
data_medals = pd.read_csv(filepath + "Medals.csv", encoding='ISO-8859-1')
data_country = pd.read_csv(filepath + "Athelete_Country_Map.csv", encoding="ISO-8859-1")
data_sports = pd.read_csv(filepath + "Athelete_Sports_Map.csv", encoding="ISO-8859-1")

In [18]:
data_medals.head()

Unnamed: 0,Athlete,Age,Year,Closing Ceremony Date,Gold Medals,Silver Medals,Bronze Medals,Total Medals
0,Michael Phelps,23.0,2008,08/24/2008,8,0,0,8
1,Michael Phelps,19.0,2004,08/29/2004,6,0,2,8
2,Michael Phelps,27.0,2012,08/12/2012,4,2,0,6
3,Natalie Coughlin,25.0,2008,08/24/2008,1,2,3,6
4,Aleksey Nemov,24.0,2000,10/01/2000,2,1,3,6


In [19]:
data_country.head()

Unnamed: 0,Athlete,Country
0,Michael Phelps,United States
1,Natalie Coughlin,United States
2,Aleksey Nemov,Russia
3,Alicia Coutts,Australia
4,Missy Franklin,United States


In [20]:
data_sports.head()

Unnamed: 0,Athlete,Sport
0,Michael Phelps,Swimming
1,Natalie Coughlin,Swimming
2,Aleksey Nemov,Gymnastics
3,Alicia Coutts,Swimming
4,Missy Franklin,Swimming


In many cases we can find the problem of repeated elements and it is important to know how to filter it to lose the minimum amount of information possible. In this case, we have athletes who have won Olympic medals more than once and who have participated for more than one nation.

In [21]:
# Extract name of unique athletes
uniq_ath = data_medals['Athlete'].unique().tolist()
print(f"Unique Athletes in Medals DF: {len(uniq_ath)}")
print(f"Data Country Rows Number: {len(data_country)}")
print(f"Data Sports Rows Number: {len(data_sports)}")

Unique Athletes in Medals DF: 6956
Data Country Rows Number: 6970
Data Sports Rows Number: 6975


To check how there are athletes who have won Olympic medals for more than one nation, we can apply the following filter to some specific cases.

In [22]:
data_sports[
    (data_sports["Athlete"] == "Chen Jing")
    | (data_sports["Athlete"] == "Richard Thompson")
    | (data_sports["Athlete"] == "Matt Ryan")
]

Unnamed: 0,Athlete,Sport
528,Richard Thompson,Athletics
1308,Chen Jing,Volleyball
1419,Chen Jing,Table Tennis
2727,Matt Ryan,Rowing
5003,Matt Ryan,Equestrian
5691,Richard Thompson,Baseball


## Merge Function
The 'pd.merge()' function in Pandas is used to combine two DataFrames based on one or more shared columns. Essentially, it allows performing "join" operations similar to those found in relational databases.

- **Data Combination:** You can merge two DataFrames into one using common columns as join keys. This enables you to aggregate information from different sources into a single DataFrame for more comprehensive analysis.
- Ty**pes of Joins:** You can specify what type of join you want to perform, such as inner join, left join, right join, or outer join. This gives you flexibility in deciding which rows to include in the final result based on the relationship between the DataFrame columns.
- **Specifying Join Columns:** You can specify the columns on which you want to perform the join. This is useful when DataFrames have multiple columns in common or when the columns you want to use for joining have different names in the DataFrames.
- **Complete Control over the Result:** You can control the final result using parameters like left_on, right_on, left_index, right_index, among others. This allows you to customize how the data is merged and which columns are included in the final result.

[Merge Pandas Function Web](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)

In [23]:
# Delete duplicates athletes from Country DataFrame to keep junt one country
data_country_dp = data_country.drop_duplicates('Athlete')

# Check the length of non-duplicated elements is equal to that of athletes
print(f"Unique Athletes in Medals DF: {len(uniq_ath)} \nLength of Non-Duplicated Athletes in Country DF: {len(data_country_dp)}")

# Merge Medals DF with non-duplicated Country DF using default merge
data_medals_country = pd.merge(data_medals, data_country_dp, on='Athlete')

# Verify length of merged dataframe is equal to medals dataframe
print(f"\nMedals DF length: {data_medals.shape[0]} and Merged Medals-Country DF length: {data_medals_country.shape[0]}")
data_medals_country.head()

Unique Athletes in Medals DF: 6956 
Length of Non-Duplicated Athletes in Country DF: 6956

Medals DF length: 8618 and Merged Medals-Country DF length: 8618


Unnamed: 0,Athlete,Age,Year,Closing Ceremony Date,Gold Medals,Silver Medals,Bronze Medals,Total Medals,Country
0,Michael Phelps,23.0,2008,08/24/2008,8,0,0,8,United States
1,Michael Phelps,19.0,2004,08/29/2004,6,0,2,8,United States
2,Michael Phelps,27.0,2012,08/12/2012,4,2,0,6,United States
3,Natalie Coughlin,25.0,2008,08/24/2008,1,2,3,6,United States
4,Natalie Coughlin,21.0,2004,08/29/2004,2,2,1,5,United States


While in our Country DataFrame we have the athlete repeated with the countries it has represented, we find it necessary to eliminate duplicates so that the medals obtained are not duplicated in our merge.

In [24]:
data_medals_country_b = pd.merge(data_medals, data_country, on='Athlete')
data_medals_country_b.head()

Unnamed: 0,Athlete,Age,Year,Closing Ceremony Date,Gold Medals,Silver Medals,Bronze Medals,Total Medals,Country
0,Michael Phelps,23.0,2008,08/24/2008,8,0,0,8,United States
1,Michael Phelps,19.0,2004,08/29/2004,6,0,2,8,United States
2,Michael Phelps,27.0,2012,08/12/2012,4,2,0,6,United States
3,Natalie Coughlin,25.0,2008,08/24/2008,1,2,3,6,United States
4,Natalie Coughlin,21.0,2004,08/29/2004,2,2,1,5,United States


If we compare the data, we can see how some of the athletes are repeated due to the name of the country, but the data is the same. Therefore, we eliminate this (although it is not always recommended) to work properly.

In [25]:
data_medals_country_b[data_medals_country_b['Athlete'] == 'Aleksandar Ciric']

Unnamed: 0,Athlete,Age,Year,Closing Ceremony Date,Gold Medals,Silver Medals,Bronze Medals,Total Medals,Country
1503,Aleksandar Ciric,30.0,2008,08/24/2008,0,0,1,1,Serbia
1504,Aleksandar Ciric,30.0,2008,08/24/2008,0,0,1,1,Serbia and Montenegro
1505,Aleksandar Ciric,26.0,2004,08/29/2004,0,1,0,1,Serbia
1506,Aleksandar Ciric,26.0,2004,08/29/2004,0,1,0,1,Serbia and Montenegro
1507,Aleksandar Ciric,22.0,2000,10/01/2000,0,0,1,1,Serbia
1508,Aleksandar Ciric,22.0,2000,10/01/2000,0,0,1,1,Serbia and Montenegro


In [26]:
data_medals_country[data_medals_country['Athlete'] == 'Aleksandar Ciric']

Unnamed: 0,Athlete,Age,Year,Closing Ceremony Date,Gold Medals,Silver Medals,Bronze Medals,Total Medals,Country
1491,Aleksandar Ciric,30.0,2008,08/24/2008,0,0,1,1,Serbia
1492,Aleksandar Ciric,26.0,2004,08/29/2004,0,1,0,1,Serbia
1493,Aleksandar Ciric,22.0,2000,10/01/2000,0,0,1,1,Serbia


In [32]:
data_sports_dp = data_sports.drop_duplicates('Athlete')
data_final = pd.merge(data_medals_country, data_sports_dp, on='Athlete')

print(f"Non-duplicated Data Sports is equal to Unique Athletes with medals: {data_sports_dp.shape[0] == len(uniq_ath)}")
print(f"Participations of athletes with medals: {data_medals.shape[0]} \nFinal DF length: {data_final.shape[0]}")
data_final.head()

Non-duplicated Data Sports is equal to Unique Athletes with medals: True
Participations of athletes with medals: 8618 
Final DF length: 8618


Unnamed: 0,Athlete,Age,Year,Closing Ceremony Date,Gold Medals,Silver Medals,Bronze Medals,Total Medals,Country,Sport
0,Michael Phelps,23.0,2008,08/24/2008,8,0,0,8,United States,Swimming
1,Michael Phelps,19.0,2004,08/29/2004,6,0,2,8,United States,Swimming
2,Michael Phelps,27.0,2012,08/12/2012,4,2,0,6,United States,Swimming
3,Natalie Coughlin,25.0,2008,08/24/2008,1,2,3,6,United States,Swimming
4,Natalie Coughlin,21.0,2004,08/29/2004,2,2,1,5,United States,Swimming
