# Lorcana Data Acquisition #
The first step in this process is data acquisition. I will start in this notebook simply by acquiring a list of all cards, as well as images (when available), which I was able to find on [lorcanaplayer.com](https://lorcanaplayer.com/lorcana-card-list/).

## Building a List of All Cards ##
There is a website with a reasonable list of all the cards. It is formatted as an HTML table, so I should be able to use pandas to read the data direction. It also has pictures of all the cards, which I would also like to have. I am going to start by pulling what data I can with pandas. I will circle back to the images.

In [4]:
import os
from io import StringIO
import pandas as pd
import requests

In [5]:
card_list_url = r'https://lorcanaplayer.com/lorcana-card-list/#the-first-chapter'
resp = requests.get(card_list_url)

card_lists = pd.read_html(StringIO(resp.text))

# I am going to start with The First Chapter
df_first_chapter = card_lists[1]
df_first_chapter.head()

Unnamed: 0,Image,Card number,Card Name,Type,Ink color,Rarity
0,,1/204,Ariel - On Human Legs,Character,Amber,Uncommon
1,,2/204,Ariel - Spectacular Singer,Character,Amber,Super Rare
2,,3/204,Cinderella - Gentle and Kind,Character,Amber,Uncommon
3,,4/204,Goofy - Musketeer,Character,Amber,Uncommon
4,,5/204,Hades - King of Olympus,Character,Amber,Rare


That looks good to me. I will need to find a way to get the images because not only did they not come through, the link the underlying file did not either. I will find a way to pull those links and align them with their cards. I will also download all the images and save them locally.

First, I will simply save the tables as files. I will start by directly saving the raw data. I will want to do some light processing (e.g. parsing the `Card Number` attribute as an integer) and save the processed files separately. Later, I will build a database to hold the data and a simple pipeline to transform the data.

In [6]:
raw_dir = r'./data/raw'
df_first_chapter.to_csv(os.path.join(raw_dir, 'the_first_chapter.csv'), index=False)

# Rise of the Floodborn
df_rotfb = card_lists[0]
df_rotfb.to_csv(os.path.join(raw_dir, 'rise_of_the_floodborn.csv'), index=False)

# Disney 100 Promos
df_d100_promos = card_lists[2]
df_d100_promos.to_csv(os.path.join(raw_dir, 'd100_promos.csv'), index=False)

The data look reasonable to me. Next, I will try to obtain the images from the website

### Card Images ###

I am only somewhat familiar with the packages I'm going to use, so this will be a learning experience. I am going to you BeautifulSoup to extract the image addresses. I think if I find all the image elements inside the card list tables, they should be in order, so I should be able to line them up. But I'm not sure.

[This StackOverflow answer](https://stackoverflow.com/questions/2010481/how-do-you-get-all-the-rows-from-a-particular-table-using-beautifulsoup) isn't exactly what we need, but I think it's close enough that I can make it work.

In [7]:
from bs4 import BeautifulSoup

In [8]:
soup = BeautifulSoup(resp.text, 'html.parser')

# I need elements like '//table[@class="card-list-table"]/tbody/tr/td/a/img'
tables = soup.findChildren('table')
first_chapter_html = tables[1]
#print(first_chapter_html)

images = first_chapter_html.findChildren('img')
images[0]['src']

'https://e40b7872.flyingcdn.com/wp-content/uploads/2023/04/Ariel-On-Human-Legs-1-158x220.jpg'

In [9]:
# Download all the images
image_dir = r'./data/images'
if not os.path.isdir(image_dir):
    os.mkdir(image_dir)

for image in images:
    # I want to stream the image data
    img_resp = requests.get(image['src'], stream=True)

    if img_resp.status_code == 200:
        img_filename = os.path.basename(image['src'])
        img_out_path = os.path.join(image_dir, img_filename)
        with open(img_out_path, 'wb') as f:
            for chunk in img_resp.iter_content(chunk_size=8192):
                f.write(chunk)
    else:
        print("error with image download")
        # TODO: more robust error handling

I can see image files populating, so this seems to be working. There are a couple of **To Do** items I will eventually want to take on:
- Wrap the cell above as a function
- Add some useful metadata to the output file names (and rename the files I already downloaded)

I will keep on moving for now. I will work mostly with the First Chapter set for this first pass.