This file is a "step 2" of sorts, assuming you've already created an 'account.env' file per the 'Ravelry Pattern Data Gathering' notebook. This notebook goes through the steps to gather the relevant yarn data for our analysis. As of this writing, going to the Yarn Search of Ravelry and narrowing down to yarn via attributes: hand-dyed shows <b>21,320</b> results. This notebook will grab data from the top 1,000, sorted by "most projects," ~5% of the total results.

The process will be very similar to gathering our pattern data, and will also follow the process of assigning the final dataframe to a .csv for future import and analysis.

In [1]:
# settings.py
from dotenv import load_dotenv
load_dotenv()

# OR, explicitly providing path to '.env'
from pathlib import Path  # python3 only
env_path = 'account.env'
load_dotenv(dotenv_path=env_path)

True

In [2]:
import os
RAVELRY_USERNAME = os.getenv("RAVELRY_USERNAME")
RAVELRY_PASSWORD = os.getenv("RAVELRY_PASSWORD")

In [3]:
import requests
import json
import pandas as pd

In [5]:
url = "https://api.ravelry.com/yarns/search.json"
response = requests.get(url, params={"sort": "projects", "page_size": 1000,
                                    "ya":"hand-dyed"}, auth=(RAVELRY_USERNAME, RAVELRY_PASSWORD))
print(response)

<Response [200]>


In [7]:
yarnData = response.json()
yarnData.keys()

dict_keys(['yarns', 'paginator'])

In [8]:
yarnData = yarnData['yarns']

In [13]:
yarnData[0]

{'discontinued': False,
 'gauge_divisor': 4,
 'grams': 100,
 'id': 1666,
 'machine_washable': None,
 'max_gauge': None,
 'min_gauge': 18.0,
 'name': 'Worsted',
 'permalink': 'malabrigo-yarn-worsted',
 'rating_average': 4.73,
 'rating_count': 19461,
 'rating_total': 92035,
 'texture': 'singles',
 'thread_size': None,
 'wpi': 8,
 'yardage': 210,
 'yarn_company_name': 'Malabrigo Yarn',
 'first_photo': {'id': 17858294,
  'sort_order': 1,
  'x_offset': -34,
  'y_offset': -15,
  'square_url': 'https://images4-g.ravelrycache.com/uploads/jomejo209/15896181/IMG_1120_square.JPG',
  'medium_url': 'https://images4-f.ravelrycache.com/uploads/jomejo209/15896181/IMG_1120_medium.JPG',
  'thumbnail_url': 'https://images4-g.ravelrycache.com/uploads/jomejo209/15896181/IMG_1120_thumbnail.JPG',
  'small_url': 'https://images4-f.ravelrycache.com/uploads/jomejo209/15896181/IMG_1120_small.JPG',
  'medium2_url': 'https://images4-f.ravelrycache.com/uploads/jomejo209/15896181/IMG_1120_medium2.JPG',
  'small2_url

In [33]:
yarnData[0].keys()

dict_keys(['discontinued', 'gauge_divisor', 'grams', 'id', 'machine_washable', 'max_gauge', 'min_gauge', 'name', 'permalink', 'rating_average', 'rating_count', 'rating_total', 'texture', 'thread_size', 'wpi', 'yardage', 'yarn_company_name', 'first_photo', 'personal_attributes', 'yarn_weight'])

These are the fields that we're most interested in gathering for yarn data:

- discontinued
- grams
- id
- machine_washable
- rating_average
- yardage
- yarn_weight > name

Initially, we had considered going one level deeper to pull more yarn data to get the fiber contents, but we ultimately didn't feel we could gain any great insights using our analysis methods in how we'd pull this data. The varying content percentages as well as varying possible total number of fibers make it a difficult metric to pull and analyze/compare.

With dropping the idea of grabbing the yarn fibers, we can get all of the data we need from our initial GET.

So, at this point, we have all of the data we need, we just need to organize it into pretty lists, zip it up into a dataframe, and export to csv.

In [44]:
discontinuedls = []
gramsls = []
idls = []
machine_washablels = []
rating_averagels = []
yardagels = []
yarn_weightls = []

In [45]:
for yarn in yarnData:
    try: discontinued = yarn['discontinued']
    except: discontinued = None
        
    try: grams = yarn['grams']
    except: grams = None
        
    try: yid = yarn['id']
    except: yid = None
        
    try: machine_washable = yarn['machine_washable']
    except: machine_washable = None
        
    try: rating_average = yarn['rating_average']
    except: rating_average = None
        
    try: yardage = yarn['yardage']
    except: yardage = None
        
    try: yarn_weight = yarn['yarn_weight']['name']
    except: yarn_weight = None
        
    discontinuedls.append(discontinued)
    gramsls.append(grams)
    idls.append(yid)
    machine_washablels.append(machine_washable)
    rating_averagels.append(rating_average)
    yardagels.append(yardage)
    yarn_weightls.append(yarn_weight)

In [46]:
print(len(discontinuedls))
print(len(gramsls))
print(len(idls))
print(len(machine_washablels))
print(len(rating_averagels))
print(len(yardagels))
print(len(yarn_weightls))

1000
1000
1000
1000
1000
1000
1000


In [47]:
TopYarnList = list(zip(idls, discontinuedls, gramsls, machine_washablels,
                       rating_averagels, yardagels, yarn_weightls))
colnames = ['Yarn ID', 'Discontinued', 'Grams', 'Machine Washable', 'Average Rating', 'Yardage', 'Yarn Weight']

In [48]:
df = pd.DataFrame(TopYarnList, columns=colnames)

df.head(10)

Unnamed: 0,Yarn ID,Discontinued,Grams,Machine Washable,Average Rating,Yardage,Yarn Weight
0,1666,False,100.0,,4.73,210.0,Aran
1,53539,False,,True,4.7,420.0,Fingering
2,26385,False,100.0,True,4.73,440.0,Light Fingering
3,3893,False,150.0,True,4.71,574.0,Fingering
4,24750,False,150.0,True,4.69,510.0,Fingering
5,35367,False,,True,4.78,225.0,DK
6,8053,False,,True,4.71,395.0,Fingering
7,55637,False,,True,4.8,200.0,Worsted
8,3248,False,146.0,True,4.66,405.0,Fingering
9,3847,False,50.0,,4.52,470.0,Lace


In [49]:
df.isnull().sum()

Yarn ID               0
Discontinued          0
Grams                38
Machine Washable    404
Average Rating        0
Yardage               2
Yarn Weight           3
dtype: int64

In [50]:
df['Machine Washable'].fillna(False, inplace=True)

In [51]:
df.isnull().sum()

Yarn ID              0
Discontinued         0
Grams               38
Machine Washable     0
Average Rating       0
Yardage              2
Yarn Weight          3
dtype: int64

At this point, I believe that the remaining null values are in a tolerable enough limit and in a fairly small number of fields, that during specific analysis we can choose to ignore the nulls and proceed without any issues.

In [52]:
df.to_csv('TopRavelryYarnList.csv', index=False, header=True)