# info_strada_sports_scrape

This python script pulls athlete profile information from certain `infostradasports.com` pages. It obviously will work on the currently set `INFOSTRADASPORTS_URL` url and likely on other pages with a matching template / design.

## Global Variables

- `INFOSTRADASPORTS_URL` : URL which data is pulled from.
- `NUM_OF_ATHLETE_PROFILES_TO_CAPTURE` : number of profiles to capture. Set this to `0` to capture all profiles from page

## Imports

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import time
from pprint import pprint

## Global Varialbes

In [2]:
NUM_OF_ATHLETE_PROFILES_TO_CAPTURE = 21
INFOSTRADASPORTS_URL = 'http://ipc.infostradasports.com/asp/lib/TheASP.asp?pageid=8903&sportid=514&NOCClubID=-1&Olympic=0&WinterGames=-1&ContinentGeoID=-1'

## functions

In [3]:
def get_unique_keys(athlete_profiles):
    athlete_profile_keys = set()
    for athlete_profile in athlete_profiles:
        for key in athlete_profile:
            athlete_profile_keys.add(key)
    return athlete_profile_keys

def initialize_database(unique_keys):
    athlete_database = {}
    for key in unique_keys:
        athlete_database[key] = []
    return athlete_database
        
def fill_database(database, athlete_profiles):
    for athlete_profile in athlete_profiles:
        for key_d, value_d in database.items():
            if key_d in athlete_profile:
                database[key_d].append(athlete_profile[key_d])
            else:
                database[key_d].append('--')
    return database

            
def get_athlete_database(athlete_profiles):
    unique_keys = get_unique_keys(athlete_profiles)
    database = initialize_database(unique_keys)
    database = fill_database(database, athlete_profiles)
    return database

def fixup_whitespace(text):
    return " ".join(text.split())


# --- scrape functions --- #
    
def scrape_athlete_profile(bio_table):
    bio_table_trs = bio_table.find_all('tr')
    details = {}
    for index, tr in enumerate(bio_table_trs):
        if index == 0:
            details['Name'] = fixup_whitespace(tr.get_text())
        else:
            tds = tr.find_all('td')
            details[tds[0].get_text()] = fixup_whitespace(tds[1].get_text())
    return details

def scrape_human_interest_info(bio_table):
    bio_table_trs = bio_table.find_all('tr')
    details = {}
    for index, tr in enumerate(bio_table_trs):
        if len(tr.find_all('td')) == 1:
            continue
        else:
            tds = tr.find_all('td')
            details[tds[0].get_text()] = fixup_whitespace(tds[1].get_text())
    return details


def scrape_competition_highlights_info(bio_table):
    pass

## Initialize web driver, visit `INFOSTRADASPORTS_URL` and capture links

In [4]:
driver = webdriver.Chrome(ChromeDriverManager().install())




[WDM] - Current google-chrome version is 103.0.5060
[WDM] - Get LATEST chromedriver version for 103.0.5060 google-chrome
[WDM] - Driver [/home/ff/.wdm/drivers/chromedriver/linux64/103.0.5060.134/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [5]:
driver.get(INFOSTRADASPORTS_URL)
links = driver.find_elements_by_xpath("//a[@href]")

  links = driver.find_elements_by_xpath("//a[@href]")


## Script

In [7]:
df_final

Unnamed: 0,NPC,Height,Age,Place of Birth,Sport,Gender,Name,Impairment,Nicknames,Other information,...,Higher education,When and where did you begin this sport?,Training Regime,Name of coach,Most influential person in career,Type of Impairment,Year,Tournament,Memorable sporting achievement,Residence
0,Afghanistan,--,43,--,"Para athletics, Para swimming, Shooting Para s...",Men,HAIDARI Zubair,--,--,--,...,--,--,--,--,--,--,--,--,--,--
1,Afghanistan,--,37,--,Para swimming,Men,PARWANI Khan Agha,--,--,--,...,--,--,--,--,--,--,--,--,--,--
2,Andorra,--,51,--,Para swimming,Men,CODINA MOLINE Marc,--,--,--,...,--,--,--,--,--,--,--,--,--,--
3,Andorra,--,37,--,Para swimming,Men,SANCHEZ FRANCISCO Antonio,--,--,NATIONAL FIRSTHe became the first person to re...,...,--,--,--,--,--,--,--,--,--,--
4,Angola,--,29,--,Para swimming,Men,LOPES Silvio Mendes,--,--,--,...,"Universidade Tecnica de Angola: Luanda, ANG",--,He trains at the First of August Swimming Pool...,Priscila Fernandes,--,Limb deficiency,--,--,--,"Luanda, ANG"
5,Angola,--,34,--,Para swimming,Women,MORAIS Jandira Q. Paixao,--,--,STUDY FIRSTShe did not compete in competitions...,...,--,--,--,Priscila Fernandes,--,--,--,--,--,"Luanda, ANG"
6,Argentina,--,21,"Río Gallegos, ARG",Para swimming,Women,ALONSO Milagros,"She has a specific language impairment, with s...","Mili (elmediadortv.com.ar, 13 Mar 2017)",--,...,--,She began swimming at age six.,--,--,--,Intellectual impairment,--,--,--,--
7,Argentina,--,21,"Villa Carlos Paz, ARG",Para swimming,Women,ARAGÓN Jazmín,She sustained a brachial plexus injury at birt...,"Jaz (Facebook profile, 12 Apr 2019)",--,...,Fashion Design - University of Buenos Aires [U...,She began swimming at age seven in Villa Carlo...,--,Edith Arraspide [club],"Her mother. (cadena3.com, 10 Mar 2019)",Impaired range of motion,--,--,--,"Villa Carlos Paz, ARG"
8,Argentina,1.77 m,31,"Lobería, ARG",Para swimming,Men,ARAYA Elián,His impairment affects his hearing and means h...,--,--,...,--,"He began swimming in 2002 in Loberia, Argentina.",--,Pablo Quinteros [club]; Marcela Belviso [natio...,--,Intellectual impairment,2013,--,--,--
9,Argentina,--,25,--,Para swimming,Men,ARCE Facundo Matias,--,--,--,...,--,--,--,--,--,Cerebral Palsy,--,--,--,"Trelew, ARG"
