This is just a test run of some code to pull out the user review data for the user = "ConnorEatsPants", before trying to scale it up 
to the full data set. The webpage in question is:

https://letterboxd.com/connoreatspants/

Important Notes: Need to install selenium and webdriver for any of this to work.

-Selenium: Just use [pip install selenium] in terminal

-webdriver: Follow these docs

https://github.com/SergeyPirogov/webdriver_manager

Technically this is for a different package, but it supposedly makes life easier.

My versions when running this code are:

selenium.__version__ = 4.36.0

The version does change the syntax for webdriver-manager a bit, but this is covered on the page.

The general idea is to use selenium to run a virtual browswer of their webpage to get access to the information. 
After that, lxml runs through the xml tree to find the pieces of the webpage we're interested in. We then have to convert the info
since it's of the form "n .5star reviews" and we only need n. After we do extract n, we just write it to a .csv file which marks 
according to the different rating scales.

Credit: This code is just a modification of Data Science FilmMaker on medium.com

In [36]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from lxml import html

chrome_options = Options()
chrome_options.add_argument("--headless") 

## These two just give headless browsing (= no browser actually pops up), which speeds things up. 
## Without them  my run times were often about 40s for each function call, whereas with them 
# it seems to drop to ~10 s per each function call

browser = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options = chrome_options)

In [37]:

reviewers = ['connoreatspants']     

## This is just set up this way so we can make a list of all of the reviewers separately, and then import it

## TODO: Make a function which makes a list of the reviewers we want to use

def reviewer_ratings():

    # Open the virtual browser
    browser = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options = chrome_options)

    filename = 'connoreatspants_test.csv'
    with open(filename, 'w+') as file_object:
        file_object.write("User,.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5\n")


    with open(filename, 'a') as file_object:

        for g in range(len(reviewers)):

            print(f"Opening " + reviewers[g] + "'s page...")
            pagename = 'https://letterboxd.com/' + reviewers[g] + '/'
            browser.get(pagename)
            print("Getting innerHTML...")
            innerHTML = browser.execute_script("return document.body.innerHTML")
            print("Creating tree...")
            tree = html.fromstring(innerHTML)
        
            # Grab the relevant data and put into lists
            print("parsing...")

                ## This one is funky. Looking through the html, the best way to uniquely identify the different 
                ## reviews is just to specify the location of the bars on the page. The general formula seems to be 
                ## location(n stars) = "width 17px; left: 2n-1*18 px" or so

            reviews_05 = tree.xpath('//li[@style="width: 17px; left: 0px"]/a/@data-original-title')
            reviews_1 = tree.xpath('//li[@style="width: 17px; left: 18px"]/a/@data-original-title')     ## Note: Can copy .xml path in inspect element,
            reviews_15 = tree.xpath('//li[@style="width: 17px; left: 36px"]/a/@data-original-title')    ## but this seems to give faster run times???
            reviews_2 = tree.xpath('//li[@style="width: 17px; left: 54px"]/a/@data-original-title')  
            reviews_25 = tree.xpath('//li[@style="width: 17px; left: 72px"]/a/@data-original-title') 
            reviews_3 = tree.xpath('//li[@style="width: 17px; left: 90px"]/a/@data-original-title')  
            reviews_35 = tree.xpath('//li[@style="width: 17px; left: 108px"]/a/@data-original-title')   
            reviews_4 = tree.xpath('//li[@style="width: 17px; left: 126px"]/a/@data-original-title')
            reviews_45 = tree.xpath('//li[@style="width: 17px; left: 144px"]/a/@data-original-title')  
            reviews_5 = tree.xpath('//li[@style="width: 17px; left: 162px"]/a/@data-original-title')
            
                ## At this point, the output is along the lines of ['<#> .5 star reviews'] so we want to extract the information of
                ## n out of the string. 
                ## The general set up is to split the string along spaces, and pick out the first element so we recover the whole number. 
                ## E.g., '15 1star review'.split(" ") -> ["15", "1star", "review"] -> '15'


            print("Extracting Review Number...")
            sr05 = reviews_05[0].split(" ")[0]          
            sr1 = reviews_1[0].split(" ")[0]
            sr15 = reviews_15[0].split(" ")[0]
            sr2 = reviews_2[0].split(" ")[0]
            sr25 = reviews_25[0].split(" ")[0]
            sr3 = reviews_3[0].split(" ")[0]
            sr35 = reviews_35[0].split(" ")[0]
            sr4 = reviews_4[0].split(" ")[0]
            sr45 = reviews_45[0].split(" ")[0]
            sr5 = reviews_5[0].split(" ")[0]

            print("Writing to File...")
            file_object.write(reviewers[g]+",")
            file_object.write(sr05+","+sr1+","+sr15+","+sr2+","+sr25+","+sr3+","+sr35+","+sr4+","+sr45+","+sr5+"\n")


        print(f"File {filename} written")

In [38]:
reviewer_ratings() ## Run time seems very dependent on factors I don't understand

Opening connoreatspants's page...
Getting innerHTML...
Creating tree...
parsing...
Extracting Review Number...
Writing to File...
File connoreatspants_test.csv written


Now I just want to test it out manually so I'll just read the file directly and check. We should get a .csv file of the form

user     .5     1     2    ...

connoreatspants   3   6   8   ...

In [40]:
with open("connoreatspants_test.csv","r") as filename:
    print(filename.read())


User,.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5
connoreatspants,3,6,8,11,8,19,29,34,19,23



Running the above code does verify this. What's left after this is to compile a list of reviewers we want to use for the data, and then 
run the above code to get their review data. Then we can start to do statistics on them