# Web-scraping User Profile Information

This Project is using data from a community driven website for traditional Irish music

https://thesession.org/

in which users can submit tunes in '.abc' format.

The websights designer kindly provides portions of the website database in a json format

https://github.com/adactio/TheSession-data

which includes:

- Tune settings (submitted by users)
- Recordings of tunes (which are cross-referenced against tune settings)
- Aliases of tunes (traditional tunes are referred to by different names)
- Sessions and Events (posted by users)

This forms the initial basis of the of the SQL-based database defined in thesessionDB.py 

However the site has additional information that I am interested in using.

Namely the users have the option of providing a location

https://thesession.org/members/97793

which we can then link to their saved tunes

https://thesession.org/members/97793/tunebook

And tunes for which they have submitted settings

https://thesession.org/members/97793/tunes

Scapeing code is located scraper.py

I plan to aggregate this information and link it with the existing the existing thesessionDB.

In [3]:
from __future__ import absolute_import, division, print_function

# scraper.py
from scraper import userProfile, userTunes


print('Testing userProfile:')
#up = userProfile(97793)
up = userProfile(1)

print('\nUser:',up.userName,up.user_id)

print('\nDescription:\n',up.description)

print('\nLocation:\n',up.location)
print(up.locationInfo['city'])
print(up.locationInfo['county'])
print(up.locationInfo['city'])
print(up.locationInfo['state'])

print('\nTesting userProfile:')
#utb = userTunes(97793)
utb = userTunes(1)

print('\nUser\'s settings:')
print(utb.settings)

print('\nUser\'s tunes:')
print(utb.tunes)

print('\nUser\'s sets:')
print(utb.user_sets)


Testing userProfile:
https://thesession.org/members/1

User: Jeremy 1

Description:
 
My name is Jeremy. I’m from the town of Cobh, county Cork but I’m living in Brighton, southern England now.

I play the bouzouki (melody mostly, not backing) and a bit of banjo, mandolin and guitar.

When I’m not noodling on the bouzouki, I make websites: https://clearleft.com

My home on the web is https://adactio.com


Location:
 [50.8325119, -0.118125]
Brighton
Brighton and Hove
Brighton
England

Testing userProfile:
https://thesession.org/members/1/tunebook?page=1
https://thesession.org/members/1/tunes?page=1
https://thesession.org/members/1/sets?page=1
https://thesession.org/members/1/sets/13048
<a href="/tunes/322#setting322">The Green Fields Of Rossbeigh</a>
[322, 322]
<a href="/tunes/812#setting20721">The Virginia</a>
[812, 20721]
https://thesession.org/members/1/sets/12951
<a href="/tunes/1050#setting21229">The Maple Leaf</a>
[1050, 21229]
<a href="/tunes/973#setting973">Man Of Aran</a>
[973,

In [None]:
x = utb.tmplist[0]
x.attrs

In [2]:
utb.user_sets

[{'set_id': u'13048',
  'settings': [322, 20721],
  'tunes': [322, 812],
  'user_id': u'1'},
 {'set_id': u'12951',
  'settings': [21229, 973],
  'tunes': [1050, 973],
  'user_id': u'1'},
 {'set_id': u'12873',
  'settings': [496, 240, 359],
  'tunes': [496, 240, 359],
  'user_id': u'1'},
 {'set_id': u'12815',
  'settings': [1527, 11897],
  'tunes': [1527, 11897],
  'user_id': u'1'},
 {'set_id': u'12805',
  'settings': [68, 69, 73],
  'tunes': [68, 69, 73],
  'user_id': u'1'},
 {'set_id': u'12787',
  'settings': [840, 452],
  'tunes': [840, 452],
  'user_id': u'1'},
 {'set_id': u'12784',
  'settings': [183, 358],
  'tunes': [183, 358],
  'user_id': u'1'},
 {'set_id': u'12750',
  'settings': [51, 54],
  'tunes': [51, 54],
  'user_id': u'1'},
 {'set_id': u'12609',
  'settings': [113, 29202],
  'tunes': [113, 769],
  'user_id': u'1'},
 {'set_id': u'11881',
  'settings': [26, 1359, 19],
  'tunes': [26, 1359, 19],
  'user_id': u'1'}]

# Proposed goals for location information
Here are a number of possibilities to use this kind of information.

- Distribution of particular tunes

- Variation in popularity of tune types

- Quantifying how 'traditional' a region's repetoires is (i.e. compare to O'neills or likewise).

- Looking for similarities / influences between locations repertoires.

- Generating a list of tunes to try if you are attening a session in a new location (cross-referenced with a user's tunebook).

- Regional connectedness could be included as a feature for a ML-based 

# Tune comment information

The site has another source of information of tunes which is not included in the database provided by Jeremy. Namely the user comments on tune pages include a wealth of more specific information. There would be a variety of interesting task to approach with natural language processing including.

- Origin of tunes (this could be more specific than the previous suggestion using User locations.)
- Connnection to musicians (particularly those pre-dating modern recordings)
- Identifying traditional versus modern tunes
- Instrument-specific repertoires (i.e. piping tunes)
- More abstract traits of the tune itself? (i.e. tempo, feel)
- Related tunes via references to other tune pages (need to be cautious as people suggest tunes in this way)

In [None]:
# I just remembered that there is a json-based api...this is easier than what I did with
# the user information, although it may not have included location

# this might also provide an easier means of finding all of the settings for a given tune

src = 'https://thesession.org/tunes/27?format=json'

import urllib, json

tid  = 27
url = 'https://thesession.org/tunes/'+str(tid)+'?format=json'
print("Downloading tune comment data from %s..." % url)
response = urllib.urlopen(url)
data = json.loads(response.read())



In [None]:
data.keys()

In [None]:
# compare this to the dumped information
tuneDB = json.loads(open('json/tunes.json', 'rb').read().decode('utf8'))
tuneDB[-1].keys()

# New information is 'comments','recordings','tunebooks',
# does not include the played with information (would have to scrape that)

In [None]:
# This is only summed data, but could used as a generic popularity feature
print(data['recordings'])
print(data['tunebooks'])

In [None]:
print(data['comments'])

In [None]:
# sample terms to search for
# would also need a good way to prescribe different AKA's
{'instruments':['fiddle','flute','pipe','concetina','accordion','whistle','harp','bouzouki','mandolin'],
'musicians':['Kevin Burke','Michael Coleman','Joe Cooley',"O'Carolan",'John Doherty'],
'groups':['Altan','Bothy Band','Chieftains'],
'locations':['Antrim','Armagh','Carlow','Cavan','Clare','Cork','Donegal',
             'Down','Fermanagh','Galway','Kerry','Kildare','Kilkenny','Laois',
             'Leitrim','Limerick','Londonderry','Derry','Longford','Louth',
             'Mayo','Meath','Monaghan','Offaly','Roscommon','Sligo','Tipperary',
             'Tyrone','Waterford','Wexford','Wicklow','Dublin','London','Chicago',
             'New York','San Francisco','Boston','Shetland','Cape Breton'],
'history':['traditional','trad','modern','contemporary','composed']}

In [None]:

# # URL = Uniform Resource Locator
# try:
#     # For Python 3.0 and later
#     from urllib.request import urlopen
# except ImportError:
#     # Fall back to Python 2's urllib2
#     from urllib2 import urlopen
    
# from bs4 import BeautifulSoup
# import re
# from geopy.geocoders import Nominatim
# import numpy as np

# sample urls
profile_url = 'https://thesession.org/members/97793'
tunebook_url = 'https://thesession.org/members/97793/tunebook'
usersettings_url = 'https://thesession.org/members/97793/tunes'
tune_url = 'https://thesession.org/tunes/55'

In [None]:
# Looking at different user profiles to see what sort of classes
# of locations might be useful to look for

from geopy.geocoders import Nominatim
geolocator = Nominatim()

# some locations have alternatives to city/county/state

up = userProfile(1659)
location = geolocator.reverse(up.location)
print(location.raw)

print(location.raw['address']['town'])
print(location.raw['address']['county'])
print(location.raw['address']['state'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])

    
up = userProfile(1659)
location = geolocator.reverse(up.location)
print(location.raw)

print(location.raw['address']['town'])
print(location.raw['address']['county'])
print(location.raw['address']['state'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])

up = userProfile(13449)
location = geolocator.reverse(up.location)
print(location.raw)

#print(location.raw['address']['city'])print(location.raw['address']['county'])
print(location.raw['address']['locality'])
print(location.raw['address']['county'])
print(location.raw['address']['state_district'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])

up = userProfile(10078)
location = geolocator.reverse(up.location)
print(location.raw)

#print(location.raw['address']['city'])print(location.raw['address']['county'])
#print(location.raw['address']['city'])
print(location.raw['address']['village'])
print(location.raw['address']['county'])
print(location.raw['address']['state'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])
