# <center>An Object Oriented Approach To Web-Scrapping</center>
### Here we're scrapping friend's photos from facebook.
### But there are some common methods/ classes which can be used to scrape any site with minor changes
p.s., This project is purely developed for <b>learning purposes</b> and is intended to be used for same.
<br>p.p.s., <b>Avoid using on google colab</b>, probably due to some ip restrictions/ proxies, you won't be able to scrape there. But yeah, you may try. Either you'll end up banging your head on the wall or enlighten us all with a solution! :P
### Happy <strike>Scraping</strike> Learning! 
## Do refer to legal risks involved in scraping and do it at your own risk

In [581]:
#importing required libraries
import requests
from bs4 import BeautifulSoup
import pickle
import os
import pandas as pd
import shutil
import re

## Creating a common class for all scraping functions
### It has the below methods
<ul>
    <li> <b>Get Session</b>: Creates and returns a new session. Also adds User Agent header so that request doesn't look like coming from a bot</li>
    <li> <b>Get Existing Session Cookies</b>: It checks if a session exists already on the basis of cookie provided and uses the same instead of getting new</li>
    <li> <b>Save Cookie</b>:Saves a cookie at the path so that session can use this cookie instead of creating a new session everytime</li>
    <li> <b>Delete Session:</b> Deletes the associated cookies and the session</li>
    <li> <b>Make Request:</b> Makes a get/ post request. Returns <strike>yummy</strike> beautiful soup (:P) if asked for!</li>
    <li> <b>Download Photos:</b> Makes a get request to get image stream and download at specified path</li>

In [682]:
class UtilMethods:
    
    def get_session(self):
        #Adding User Agent is very important otherwise end point can recognise it's a bot!! 
        headers = {  # This is the important part: Nokia C3 User Agent
            'User-Agent': 'NokiaC3-00/5.0 (07.20) Profile/MIDP-2.1 Configuration/CLDC-1.1 Mozilla/5.0 AppleWebKit/420+ (KHTML, like Gecko) Safari/420+'
        }
        session = requests.session()  # Create the session for the next requests
        session.headers.update(headers)
        return session, headers
    
    # Evaluate if NOT exists a cookie file, if NOT exists the we make the Login request to Facebook,
        # else we just load the current cookie to maintain the older session.
    def get_existing_session_cookies(self, cookies_path):
        if not os.path.exists(cookies_path):
            return None
        print("Session exists! No need to relogin!")
        f = open(cookies_path, 'rb')
        
        #Python object serialization
        #“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream
        cookies = pickle.load(f)
        return cookies
    
    def save_cookies(self, cookies_path, cookies):
        f = open(cookies_path, 'wb')
        pickle.dump(cookies, f)
        
    def delete_session(self, cookies_path, session):
        if os.path.exists(cookies_path):
            print("Session exists! Need to delete!")
            session.cookies.clear()
            os.remove(cookies_path)
            if not os.path.exists(cookies_path):
                print("Cookie deleted successfully")
        else:
            print("No cookie present")
    
    # Utility function to make the requests and convert to soup object if necessary
    def make_request(self, url, session, headers, method='GET', data=None, is_soup=True, stream = False):
        if len(url) == 0:
            raise Exception(f'Empty Url')
        if method == 'GET':
            resp = session.get(url, headers=headers, stream = stream)
        elif method == 'POST':
            resp = session.post(url, headers=headers, data=data)
        else:
            raise Exception(f'Method [{method}] Not Supported')

        if resp.status_code != 200:
            raise Exception(f'Error [{resp.status_code}] > {url}')
        
        if is_soup:
            return BeautifulSoup(resp.text, 'lxml')
        return resp
    
    def download_photos(self, file_name, path, url, session, headers):
        try:
            newpath = r''+path 
            if not os.path.exists(newpath):
                os.makedirs(newpath)
            image_response = self.make_request(url, session, headers, stream = True, is_soup = False)
            with open(newpath+'/'+file_name+"_img.jpg", "wb") as out_file:
                shutil.copyfileobj(image_response.raw, out_file)
                print('File saved successfully!! Cheers! :D')
        except KeyError:
            #response is an empty dictionary
            print('Uh Uh! Please check the url/ user id/ your network connection and try again later')
    

## Creating a class for logging into the site to scrape
### It has the below methods
<ul>
    <li> <b>Constructor</b>: Takes username and password (But never saves them! Makes a call to login and deletes them!) Constructor is also responsible for getting a new or an existing session if cookie is already present.</li>
    <li> <b>Get Login Form Data</b>: Creates form data to login</li>
    <li> <b>Logout</b>: Makes a call to utility function clear cookies to delete the session.</li>
    <li> Also has a class variable, <b>Util</b> to have an instance for <b>Utility Methods</b></li>

In [683]:
class LoginToFacebook:
    session = None
    base_url = "https://m.facebook.com"
    util = UtilMethods()
    def __init__(self, username, password, create_new_session=False):
        
        self.session, self.headers = self.util.get_session()
        self.cookies_path = 'session_facebook.cki'  # Give a name to store the session in a cookie file.
        if create_new_session:
            self.login(username, password)
        else:
            cookies = self.util.get_existing_session_cookies(self.cookies_path)
            if cookies is None:
                self.login(username, password)
            else:
                self.session.cookies = cookies
        
        ''' We should not be saving sensitive information like password in class variable whether public or private.
        Hence use it to login, get session and delete it!'''
        del username
        del password
        
        # At certain point, we need find the text in the Url to point the url post, in my case, my Facebook is in
        # English, this is why it says 'Full Story', so, you need to change this for your language.
        # Some translations:
        # - English: 'Full Story'
        self.post_url_text = 'Full Story'
        self.posts = []  # Store the scraped posts

    # The first time we login
    def login(self, username, password):
        print("Session doesn't exist! Relogging to facebook!")
        # Get the content of HTML of mobile Login Facebook page
        soup = self.util.make_request(self.base_url, self.session, self.headers)
        if soup is None:
            raise Exception("Couldn't load the Login Page")
        
        # This is the url to send the login params to Facebook
        url_login = "https://m.facebook.com/login/device-based/regular/login/?refsrc=https%3A%2F%2Fm.facebook.com%2F&lwv=100&refid=8"
        payload = self.get_login_form_data(soup, username, password)
        soup = self.util.make_request(url_login, self.session, self.headers, method='POST', data=payload, is_soup=True)
        if soup is None:
            raise Exception(f"The login request couldn't be made: {url_login}")
        
        redirect = soup.select_one('a') #First anchor tag is user for redirecting
        if not redirect:
            raise Exception("Please log in desktop/mobile Facebook and change your password")
        url_redirect = redirect.get('href', '')
        resp = self.util.make_request(url_redirect, self.session, self.headers)
        if resp is None:
            raise Exception(f"The login request couldn't be made: {url_redirect}")

        # Finally we get the cookies from the session and save it in a file for future usage
        self.util.save_cookies(self.cookies_path, self.session.cookies)
        print('Logged in successfully')
    
    #function to prepare login form data
    def get_login_form_data(self, soup, username, password):
        '''Here we need to extract this tokens from the Login Page
        These are the values used for login form. 
        If we do not add these with credentials, it would be detected as suspicious activity 
        and account might get blocked too!'''
        
        #You can find these values by browsing the website and opening inspect element and look for corresponding tags
        lsd = soup.find("input", {"name": "lsd"}).get("value")
        jazoest = soup.find("input", {"name": "jazoest"}).get("value")
        m_ts = soup.find("input", {"name": "m_ts"}).get("value")
        li = soup.find("input", {"name": "li"}).get("value")
        try_number = soup.find("input", {"name": "try_number"}).get("value")
        unrecognized_tries = soup.find("input", {"name": "unrecognized_tries"}).get("value")
        payload = {
            "lsd": lsd,
            "jazoest": jazoest,
            "m_ts": m_ts,
            "li": li,
            "try_number": try_number,
            "unrecognized_tries": unrecognized_tries,
            "email": username,
            "pass": password,
            "login": "start session",
            "prefill_contact_point": "",
            "prefill_source": "",
            "prefill_type": "",
            "first_prefill_source": "",
            "first_prefill_type": "",
            "had_cp_prefilled": "false",
            "had_password_prefilled": "false",
            "is_smart_lock": "false",
            "_fb_noscript": "true"
        }
        return payload
    
    def logout(self):
        self.util.delete_session(self.cookies_path, self.session)

## Creating a class for getting the profile to scrape
### A Child class for Login
### It has the below methods
<ul>
    <li> <b>Constructor</b>: Takes username and password and passes it to super class.</li>
    <li> <b>Add Base Url</b>: A method to add base site url</li>
    <li> <b>Create Profile Url from Scratch</b>:If this class is being provided a user name instead of url, this method generates url</li>
    <li> <b>Get Username From Url:</b> If this class is being provided a url instead of user name, this method generates username</li>
    <li> <b>Prepare Profile Url:</b> Appends basic details to profile url or creates a profile from scratch if username is given</li>
    <li> <b>Get Profile:</b> Gets the requested profile's beautiful soup internally calling other methods to generate the same</li>

In [677]:
class ProfileToScrape(LoginToFacebook):
    
    def __init__(self, username, password, create_new_session = False):
        super().__init__(username, password, create_new_session)
    
    def add_base_url(self, data):
        url_suffix = str(data)
        if url_suffix.startswith('/'):
            return self.base_url + url_suffix
        else:
            return self.base_url + '/' + url_suffix
        
    def create_profile_url_from_scratch(self, user_id):
        # Prepare the Url to point to the posts feed
        return self.add_base_url(user_id) + '?v=timeline'
    
    def get_username_from_url(self, url_profile):
        username = None
        if url_profile[-1] == '/' or url_profile[-1] == '?':
            url_profile = url_profile[:-1]
        username = re.findall('com/(.+)', url_profile)
        return username
    
    def prepare_profile_url(self, url_profile):
        # Prepare the Url to point to the posts feed
        if "www." in url_profile: url_profile = url_profile.replace('www.', 'm.')
        if 'v=timeline' not in url_profile:
            if '?' in url_profile:
                url_profile = f'{url_profile}&v=timeline'
            else:
                url_profile = f'{url_profile}?v=timeline'
        return url_profile
    
    def get_profile(self, user_id, is_url = False):
        usr_profile = ''
        soup = None
        is_profile_a_group = False
        username = None
        if is_url:
            username = self.get_username_from_url(user_id)
            if username is None or len(username) == 0:
                raise Exception(f"User not found: {user_id}")
            else:
                username = username[0]
            usr_profile = self.prepare_profile_url(user_id)
            is_profile_a_group = '/groups/' in usr_profile
        else:
            usr_profile = self.create_profile_url_from_scratch(user_id)
            username = user_id
            
        # Make a simple GET request
        return self.util.make_request(usr_profile, self.session, self.headers), is_profile_a_group, username

## Creating a class for taking the scraping actions on profile
### A Child class for Profile
### It has the below methods
<ul>
    <li> <b>Constructor</b>: Takes username and password and passes it to super class.</li>
    <li> <b>Download Multiple Users Photos</b>: Takes a list of user ids/ user urls to scrape their photos, which iterative calls method for scraping single profile</li>
    <li> <b>Download User Photos</b>:Gets the profile from parent class and uses beautiful soup to generate set(for uniqueness) of urls of the user photos</li>
    <li> <b>Get Image Url From Soup:</b> It is used to get image urls from soup which is further giving to utility methods download images method to save it to your local system</li>
    <li>Hey wait, here's a bonus parameter too!</li>
    <li><b>Restricting number of images per user:</b> Since social network sites are full of images, we can restrict images per user.</li>
    </ul>

In [689]:
class ScrapeFromFacebook(ProfileToScrape):
    
    def __init__(self, username, password, create_new_session = False):
        super().__init__(username, password, create_new_session)
    
    def download_multiple_users_photos(self, url_profiles, is_url = False, max_len=None):
        for url_profile in url_profiles:
            self.download_user_photos(url_profile, is_url = is_url, max_len = max_len)
        
    def download_user_photos(self, url_profile, is_url = False, max_len = None):
        soup, is_group, username = self.get_profile(url_profile, is_url)
        list_photo_urls = set()
        if soup is None:
            print(f"Couldn't load the Page: {url_profile}")
            return []
        
        photos_url =(soup.find('a',{'href': re.compile(r'\bphotos\?lst\b')}))['href']
        
        if photos_url is not None:
            photo_soup = self.util.make_request(self.add_base_url(photos_url), self.session, self.headers, is_soup=True)
            
            # contains link to tagged photos and user's own photos
            albums_link = photo_soup.findAll('a',text='See All')
            if len(albums_link)>0:
                for i in range(len(albums_link)):
                    soup = self.util.make_request(self.add_base_url((albums_link[i])['href']), self.session, self.headers)
                    list_photo_urls.update([self.add_base_url((photo_id)['href']) for photo_id in soup.findAll('a',{'href': re.compile(r'\bphoto\.php\?fbid\b')})])
                    has_see_more = soup.findAll('span',text='See more photos')
                    while has_see_more is not None and len(has_see_more) > 0:
                        a_href = ((has_see_more[0]).find('a'))['href']
                        soup = self.util.make_request(self.add_base_url(a_href), self.session, self.headers, is_soup=True)
                        list_photo_urls.update([self.add_base_url((photo_id)['href']) for photo_id in soup.findAll('a',{'href': re.compile(r'\bphoto\.php\?fbid\b')})])
                        has_see_more = soup.findAll('span',text='See more photos')    
                
            else:
                #no 'See All', probably user has just 1 photo
                list_photo_urls.update([self.add_base_url((photo_id)['href']) for photo_id in photo_soup.findAll('a',{'href': re.compile(r'\bphoto\.php\?fbid\b')})])
        
        if len(list_photo_urls) == 0:
            print('No photos available for '+username)
        else:
            print('Total number of photos for '+username,len(list_photo_urls))
            self.get_image_url_from_soup(list_photo_urls, username, max_len)
    
    def get_image_url_from_soup(self, urls, username, max_len = None):
        for idx, url in enumerate(urls):
            if max_len is not None and idx == max_len:
                break
            soup = self.util.make_request(url, self.session, self.headers)
            root_div = soup.find("div", {"id": "root"})
            img = root_div.find("img")
            self.util.download_photos('img'+str(idx), username, img['src'], self.session, self.headers)

In [690]:
fb = ScrapeFromFacebook('username', 'password')

Session doesn't exist! Relogging to facebook!
Logged in successfully


In [691]:
#downloading photos for single profile
fb.download_user_photos('heymsakshi')

Total number of photos for rushil.sehgal 28
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D


In [692]:
#downloading photos for multiple urls
fb.download_multiple_users_photos(['https://www.facebook.com/neetukalwan','https://www.facebook.com/rushil.sehgal'], is_url=True, max_len=25)

Total number of photos for neetukalwan 261
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
Total number of photos for rushi

In [650]:
a = fb.download_user_photos('rushil.sehgal')

Total number of photos for User 28
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
File saved successfully!! Cheers! :D
Fil

In [680]:
#Don't forget to logout when done so that you end up giving unwanted access of your profile to someone else
fb.logout()

Session exists! Need to delete!
Cookie deleted successfully
