# This file web scraps players data and teams data
## The code is explained below

### First Installing the required modules
* `BeautifulSoup` for web scraping
* `pandas` for creating DataFrames to represent data in a structured format
* `requests` for sending http requests to the web site
* `json` to convert scraped data into json dictionary like format
* `time` to prevent our code from sending multiple requests at once and reduce the load on the server
* `functools` for `reduce()` function to allow for repeated list operations

**You will also be required to install **`openpyxl` ** for converting DataFrames to excel file**

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
import time
from functools import reduce

### The url structure of [Pro Kabaddi Stats](https://www.prokabaddi.com/stats)

The url structure the website very simple. It is as follows :
        https://www.prokabaddi.com/stats/{season_id}-{stats_id}-a-statistics
    Every season has an **id** associated to it.
    > For Example **Season 9 has id as 25**
    Every statistic type also has an **id** associated to it.
    > For Example **Total Points Scored by a Player has id as 102** and **Total Points Scored by a Team has id as 96**
    So if we want to access the data of **Total Points Scored By a Team in Season 9**, then the url corresponding to that will be : (https://www.prokabaddi.com/stats/25-96-a-statistics)
    
    *The url structure was understood by us manually after putting some hours into the website structure Pro Kabaddi*
The four dictionaries 
    ```
    season_to_id
    id_to_season
    player_stats_to_id
    teams_stats_to_id
    ```    
represent just the information given above.



In [33]:


season_to_id={'Season 9': 25, 'Season 8': 20, 'Season 7': 11,
              'Season 6': 10, 'Season 5': 8,  'Season 4': 4,  
              'Season 3': 3,  'Season 2': 2,  'Season 1': 1,  
              'All Seasons': 0}
id_to_season={25: 'Season 9', 20: 'Season 8', 11: 'Season 7',
              10: 'Season 6', 8: 'Season 5',  4: 'Season 4',
              3: 'Season 3',  2: 'Season 2', 1: 'Season 1',
              0: 'All Seasons'}
player_stats_to_id={'total_points': 102, 'successful_raids': 21,
                    'raid_points': 22, 'successful_tackles': 23,
                    'tackle_points': 103, 
                    'do_or_die_raid_points': 132, 'super_raids': 104,
                    'super_tackles': 28, 'super_10s': 100, 'high_5s': 101
                    }
teams_stats_to_id={'total_points': 96, 'successful_raids': 13,
                   'raid_points': 97, 'successful_tackles': 15,
                   'tackle_points': 95,
                   'do_or_die_raid_points': 135, 'super_raids': 134,
                   'super_tackles': 20, 'total_points_conceded': 133,
                   'all_outs_inflicted': 136, 'all_out_conceded': 137
                   }
url='https://www.prokabaddi.com/stats/{0}-{1}-a-statistics'

In [None]:
all_seasons_df=[]  #will contain dataframes for each season 
for season_name, season_id in season_to_id.items():
    if season_name=='All Seasons':
        continue
    data_frames=[]     #will conatin dataframes for each stat_type for current season_name
    for stat_type, stat_id in player_stats_to_id.items():
        req=requests.get(url.format(season_id, stat_id))    #make request to the website
        soup=BeautifulSoup(req.content,"html.parser")   # parse the html
        new=soup.find_all('script')[4].string   #so as to only fetch the json object and not the variable name
        new=new[new.find('{'):]  # to find the whole json object
        new2=json.loads(new)    # load the whole json object as python style dictionary
        file=new2["stats"]["data"]     #to only extract the relevant information
        for record in file:  # this step is done to clean the data, add relevant columns, and remove redundant columns
            record[stat_type]=record['value']
            record['season']=season_name
            del record['value']
            del record['team']
            del record['rank']
            del record['team_name']
            del record['position_id']
        data_frames.append(pd.DataFrame.from_dict(file))  # this give the dataframe for the current stat_type and season_name
        time.sleep(3)   # 3 seconds are set as idle so that web server load is reduced
    combined_data=reduce(lambda left, right: pd.merge(left, right, how='outer'), data_frames).fillna(0)     #combines all stat_types to give a single dataframe for the current season_name
    all_seasons_df.append(combined_data)    # adds to the list of season_wise dataframes
final_data_frame=pd.concat(all_seasons_df, ignore_index=True)   #combines all season dataframes to give a single dataframe

### Storing our data
This cell will write the dataframe into an excel file.
(openpyxl is used here)

In [32]:
final_data_frame.to_excel('players_data.xlsx', index=False)