# Code to extract data from ravelry.com using API

API documentation: https://www.ravelry.com/api

Used read-only credentials for data retrieval.

Get data for patterns sorted by most projects, yarns sorted by most projects, and shops in the US.

In [None]:
import base64
import requests
import json
import pandas as pd

In [None]:
# credentials stored in text files

username = open('../data/rav_user.txt').read()
password = open('../data/rav_pass.txt').read()

In [None]:
# pattern listing results sorted by number of projects descending.
# manually tested to see where the number of projects declines - very long tail!

def get_pattern_info(query, username, password):
    print(query)
    print(username)
    print(password)
    print('https://api.ravelry.com/patterns/search.json')
    res = requests.get('https://api.ravelry.com/patterns/search.json?query={}&sort=projects&page_size=1000&page=80'.format(query),
                       auth=requests.auth.HTTPBasicAuth(username, password))
    return res

In [None]:
res = get_pattern_info('', username, password)
result = json.loads(res.content)
result['patterns']
patterns_df = pd.DataFrame(result['patterns'])
patterns_df.tail()

# Thoughts after looking at chunks of pattern listings

Initially I was thinking to get top 10,000 but that might not make sense considering the long tail (see notes). Cutting off at patterns with at least 100 projects makes sense. 100 projects is enough to ensure that the pattern appeals to a good number of people (at least enough to create a project page for it). My purpose for this part of the analysis is to determine what type of patterns people like and what common characteristics there are - yarn weights, category, possibly amount of yarn required.

notes:

10,000th pattern has 241 projects, while 1st pattern has 23.6k.

12,000th pattern has 206 projects. 15,000th project has 169. I'm actually a little surprised by the length of the tail.

20,000th pattern has 130 projects. 30,000th has 88. 40,000th has 67. 60,000th has 43. 80,000th has 31.

# Plan:

Define a function and use for loop to get pattern ids. Page size = 1000 as that doesn't take long to run. Put the id column into a list.

Using list of pattern ids, use a function and for loop to retrieve pattern details using the appropriate API method. Ravelry does allow for multiple pattern ids to be put into the method to retrieve details so I won't have to call the API 30,000 times.

In [None]:
def get_pattern_info(username, password):
    pattid_list = []
    for page in range(1, 31):
        print(page)
        res = requests.get('https://api.ravelry.com/patterns/search.json?sort=projects&page_size=1000&page=' + str(page),
                       auth=requests.auth.HTTPBasicAuth(username, password))
        result = json.loads(res.content)
        patterns_df = pd.DataFrame(result['patterns'])
        print(patterns_df.shape)
        pattid_list.append(patterns_df.id.tolist())
    return pattid_list


In [None]:
# returns pattern ids as integers, list of lists (a list for each page of results)

pattid_list = get_pattern_info(username, password)

In [None]:
# to flatten list, and change integers into strings

pattid_flat_list = [str(patt_id) for sublist in pattid_list for patt_id in sublist]

Experimented with calling different numbers of pattern ids to see what the API would allow. 30,000 and even 1000 was too many as expected. (Getting 1000 for pattern listings was fine since that pulls much less information.) Settled on 200 as a reasonable number.

In [None]:
res = requests.get('https://api.ravelry.com/patterns.json?ids=' + '+'.join(pattid_list_str[:200]), auth=requests.auth.HTTPBasicAuth(username, password))
res.content

# Plan of attack

1: write chunk of code to return pattern details for a single id. Should be made into a dataframe and pivoted so that information is in a row instead of a column. I was originally looking at melt or wide_to_long, but Mary suggested transpose - much better!

2: expand the code to take multiple ids. After talking to M&M, wrote function and for loop to iterate through a test list of 10 ids pulled from the flattened/string-ified list. Continued to use transform to switch rows and columns, and appended each result to a dataframe. I could probably have skipped this step, but I wanted to test the code at this interval so I could fix potential problems without hitting the API for too much.

3: chunk the list of pattern ids into groups of 200 to reduce calls on the API. The integers in the list have already been changed to strings so the ids can be joined with '+' as in the API documentation.


In [None]:
# Part 1: retrieving details for an individual pattern id.

first = requests.get('https://api.ravelry.com/patterns.json?ids=' +
                     str(211562), auth=requests.auth.HTTPBasicAuth(username, password))
first_result = json.loads(first.content)
first_detail_df = pd.DataFrame(first_result['patterns']).T


In [None]:
first_detail_df

In [None]:
# Part 2: retrieving details for several ids in a list.

second_test_id_list = ['211562', '130787', '605', '169260', '29', '124400', '573', '426231', '418518', '195']

def second_get_details(username, password, test_id_list):
    second_details_df = pd.DataFrame()
    for ids in second_test_id_list:
        second = requests.get('https://api.ravelry.com/patterns.json?ids=' +
                     ids, auth=requests.auth.HTTPBasicAuth(username, password))
        second_result = json.loads(second.content)
        detail_df = pd.DataFrame(second_result['patterns']).T
        second_details_df = second_details_df.append(detail_df)
        print(second_details_df.shape)
    return second_details_df.reset_index().rename(columns = {'index': 'patt_id'})


In [None]:
second_test_df = second_get_details(username, password, second_test_id_list)

In [None]:
# Part 3: divide list of pattern ids into chunks of 200, resulting in 150 lists within the larger list

# How many elements each list should have  
n = 200
   
# using list comprehension  
pattid_chunk_list = [pattid_flat_list[i:i + n] for i in range(0, len(pattid_flat_list), n)]  
len(pattid_chunk_list)


In [None]:
# code to retrieve details for full list of ids, 200 at a time. Try/except added due to a json error in the list
# at index 9

def get_details(username, password, id_list):
    details_df = pd.DataFrame()
    for ids in pattid_chunk_list:
        res = requests.get('https://api.ravelry.com/patterns.json?ids=' +
                           '+'.join(ids), auth=requests.auth.HTTPBasicAuth(username, password))
        try:        
            result = json.loads(res.content)
            detail_df = pd.DataFrame(result['patterns']).T
            details_df = details_df.append(detail_df)
            print(details_df.shape)
        except:
            print(res.content)
            continue
    return details_df.reset_index().rename(columns = {'index': 'patt_id'})


In [None]:
patt_details_df = get_details(username, password, pattid_chunk_list)

In [None]:
# to deal with the bad value in the chunk at index 9
# append the result to patt_details_df to get the complete dataset
# code from second step of testing; using index number of problem chunk, pass that in for loop to run on each id
# individually
# incorporate try/except from previous function to skip over the problem row

def get_problem_details(username, password, id_list):
    problem_df = pd.DataFrame()
    for ids in id_list:
        res = requests.get('https://api.ravelry.com/patterns.json?ids=' +
                           ids, auth=requests.auth.HTTPBasicAuth(username, password))
        try:
            result = json.loads(res.content)
            detail_df = pd.DataFrame(result['patterns']).T
            problem_df = problem_df.append(detail_df)
            print(problem_df.shape)
        except:
            print(res.content)
            continue
    return problem_df.reset_index().rename(columns = {'index': 'patt_id'})


In [None]:
bad_json_df = get_problem_details(username, password, pattid_chunk_list[9])

In [None]:
# identify 'bad' id (I could see where it was from the print statement in the loop)

problem = pattid_chunk_list[9]
problem[147]

In [None]:
# repurposing code from first test to see if the problem id can be pulled individually
# confirmed that this is definitely the problem id and it cannot be called through API

prob_id = requests.get('https://api.ravelry.com/patterns.json?ids=' +
                     str(20), auth=requests.auth.HTTPBasicAuth(username, password))
prob_id_result = json.loads(prob_id.content)
prob_id_detail_df = pd.DataFrame(prob_id_result['patterns']).T
prob_id_detail_df = prob_id_detail_df.reset_index().rename(columns = {'index': 'patt_id'})

In [None]:
# append bad_json_df to patt_details_df for complete pattern details dataset

patt_details_df = patt_details_df.append(bad_json_df).reset_index()
patt_details_df.info()

In [None]:
patt_details_df.to_csv('../data/df_pattdetails.csv')

# Data on yarns

In [None]:
# see what yarn information looks like

def get_yarn_info(username, password):
    res = requests.get('https://api.ravelry.com/yarns/search.json?sort=projects&page_size=1000&page=20',
                       auth=requests.auth.HTTPBasicAuth(username, password))
    return res

In [None]:
res = get_yarn_info(username, password)
result = json.loads(res.content)
result['yarns']
yarns_df = pd.DataFrame(result['yarns'])
yarns_df.tail()

Most used yarn has 270.3k projects. 1000th yarn has 2525 projects.

2000th has 1277. 3000th - 825. 5000th - 428. 10000th - 189.

I think 10000 is enough to be a good number of popular yarns; cutting off at 200 projects is reasonable and limits the data to yarn that can be reasonably expected to be produced professionally.

Unlike with the pattern listings, I'm going to get all the yarn data as I can use more than the ID. If the information I can use is duplicated in the yarn details then I won't save the dataframe to a csv.

Plan: use function and for loop to get first 10 pages of yarn listings and save into dataframe. Then split id column out into a list and use it to get yarn details, similar to pattern details.

In [None]:
# get yarn info and put into dataframe

def get_yarn_info(username, password):
    yarninfo_df = pd.DataFrame()
    for page in range(1, 11):
        print(page)
        res = requests.get('https://api.ravelry.com/yarns/search.json?sort=projects&page_size=1000&page=' + str(page),
                           auth=requests.auth.HTTPBasicAuth(username, password))
        result = json.loads(res.content)
        yarns_df = pd.DataFrame(result['yarns'])
        print(yarns_df.shape)
        yarninfo_df = yarninfo_df.append(yarns_df)
        print(yarninfo_df.shape)
    return yarninfo_df



In [None]:
yarnlistings_df = get_yarn_info(username, password)

In [None]:
# pull out id column as list

yarnid_list = yarnlistings_df.id.to_list()
yarnid_list = [str(i) for i in yarnid_list]
yarnid_list

In [None]:
# first, test one id to determine shape of results
# same shape as pattern details (details for an id in a column rather than a row) so use transform as with patterns

firstyarn = requests.get('https://api.ravelry.com/yarns.json?ids=' +
                     str(2059), auth=requests.auth.HTTPBasicAuth(username, password))
firstyarn_result = json.loads(firstyarn.content)
firstyarn_detail_df = pd.DataFrame(firstyarn_result['yarns'])

In [None]:
# break yarnid_list into chunks of 200

# How many elements each list should have  
n = 200
   
# using list comprehension  
yarnid_chunk_list = [yarnid_list[i:i + n] for i in range(0, len(yarnid_list), n)]  
len(yarnid_chunk_list)


In [None]:
# pull yarn details in chunks as with patterns

def get_details(username, password, id_list):
    details_df = pd.DataFrame()
    for ids in yarnid_chunk_list:
        res = requests.get('https://api.ravelry.com/yarns.json?ids=' +
                           '+'.join(ids), auth=requests.auth.HTTPBasicAuth(username, password))
        try:        
            result = json.loads(res.content)
            detail_df = pd.DataFrame(result['yarns']).T
            details_df = details_df.append(detail_df)
            print(details_df.shape)
        except:
            print(res.content)
            continue
    return details_df.reset_index().rename(columns = {'index': 'yarn_id'})


In [None]:
yarndetails_df = get_details(username, password, yarnid_chunk_list)

In [None]:
# save both yarn dataframes to csvs - will combine in cleaning stage
yarndetails_df.to_csv('../data/df_yarndetails.csv')
yarnlistings_df.to_csv('../data/df_yarnlistings.csv')

# Data on yarn shops

In [None]:
# see what shop information looks like

def get_shop_info(username, password):
    res = requests.get('https://api.ravelry.com/shops/search.json?lat=36.142642&lng=-86.780897&radius=250&units=miles&shop_type_id=1&page_size=1000&page=1',
                       auth=requests.auth.HTTPBasicAuth(username, password))
    return res


In [None]:
res = get_shop_info(username, password)
result = json.loads(res.content)
result['shops']
shops_df = pd.DataFrame(result['shops'])
shops_df.head()

In [None]:
def get_shop_info(username, password):
    res = requests.get('https://api.ravelry.com/shops/search.json?query="Tennessee"&page_size=1000&shop_type_id=1',
                       auth=requests.auth.HTTPBasicAuth(username, password))
    return res


In [None]:
res = get_shop_info(username, password)
result = json.loads(res.content)
result['shops']
shops_df = pd.DataFrame(result['shops'])
shops_df.head()

Geographic search has radius limit of 250 miles which would make it awkward to retrieve listings, but I can pass in state names as a query search and get results. Best strategy is to use list of state names and run a loop. Tried using "United States" as query but got back 413 results which doesn't match results using the website.

In [None]:
state_list = ['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas', 'California', 'Colorado',
              'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii',
              'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
              'Massachusetts', 'Michigan', 'Minnesota', 'Minor Outlying Islands', 'Mississippi', 'Missouri',
              'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
              'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio', 'Oklahoma', 'Oregon',
              'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
              'Texas', 'U.S. Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia',
              'Wisconsin', 'Wyoming']

In [None]:
# get shop info and put into dataframe

def get_shop_info(username, password):
    shopinfo_df = pd.DataFrame()
    for state in state_list:
        print(state)
        res = requests.get('https://api.ravelry.com/shops/search.json?query=' + state + '&page_size=1000&shop_type_id=1',
                           auth=requests.auth.HTTPBasicAuth(username, password))
        result = json.loads(res.content)
        shops_df = pd.DataFrame(result['shops'])
        print(shops_df.shape)
        shopinfo_df = shopinfo_df.append(shops_df)
        print(shopinfo_df.shape)
    return shopinfo_df


In [None]:
shoplisting_df = get_shop_info(username, password)

In [None]:
shoplisting_df.info()

In [None]:
shoplisting_df.head()

In [None]:
# get shop ids into a list

shopid_list = shoplisting_df.id.to_list()


In [None]:
# first, test one id to determine shape of results
# is in row format instead of a single column, but returns 2 rows for a single shop.
# The rows seem to be identical except the row labeled 'id' has a code for country and state, and the one
# labeled 'name' has the country and state name.

firstshop = requests.get('https://api.ravelry.com/shops/' + str(6459) + '.json', 
                         auth=requests.auth.HTTPBasicAuth(username, password))
firstshop_result = json.loads(firstshop.content)
firstshop_detail_df = pd.DataFrame(firstshop_result['shop'])

In [None]:
firstshop_detail_df

In [None]:
# can only put one shop id in at a time, so use list as is without chunking it

def get_shop_details(username, password, id_list):
    details_df = pd.DataFrame()
    for ids in id_list:
        res = requests.get('https://api.ravelry.com/shops/' + str(ids) + '.json', 
                           auth=requests.auth.HTTPBasicAuth(username, password))
        result = json.loads(res.content)
        detail_df = pd.DataFrame(result['shop'])
        details_df = details_df.append(detail_df)
        print(details_df.shape)
    return details_df.reset_index()


In [None]:
shopdetails_df = get_shop_details(username, password, shopid_list)

In [None]:
# save both shop dataframes to csvs - will combine in cleaning stage
shopdetails_df.to_csv('../data/df_shopdetails.csv')
shoplisting_df.to_csv('../data/df_shoplistings.csv')