<h1 style="text-align:center;font-size: 4em"> Scrape Wikipedia Articles </h1>

## Importing the packages

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import random

## Create a function to clean up article text

In [2]:
from nltk.corpus import stopwords
import re
import string

def cleaning(text):
    # remove numbers
    text = re.sub("[0-9><]+", " ", text)
    # remove newlines
    text = re.sub(r"\n+", " ", text)
    # replace multiple spaces with one space
    text = re.sub(r"\s+", " ", text)
    # transfer text to lowercase
    text = text.lower()
    # remove punctuation
    text = re.sub(r'[^\w\s]','',text)
    
    return text

## The main function for scraping the articles from Wikipedia

In [3]:
def scrapArticle(url):
    
    number_of_articles = 20 # set the number of articles
    list_links = []
    list_titles = []
    frame = []

    for i in np.arange(0, number_of_articles):
        
        r1 = requests.get(url)   # get the HTML content 
        coverpage = r1.content   # coverpage variable contain the HTML content

        #create a soup in order to allow BeautifulSoup to work
        soup1 = BeautifulSoup(coverpage, 'html5lib')

        #locate the elements to find the links
        allLinks       = soup1.find(id="bodyContent").find_all("a") # we are looking for all links in the bodyContent
        random.shuffle(allLinks)
        
        scrapedlink = 0
        for link in allLinks:
            # We are only interested in wiki articles
            if link['href'].find("/wiki/") == -1: 
                continue

            if link['href'].find("Category:") != -1 or link['href'].find("http") != -1: 
                continue

            title = link.get_text()   # get the title of the link
            list_titles.append(title)
            
            scrapedlink = link  # Use this link to scrape
            break

        
        FinalLink = "https://en.wikipedia.org" + scrapedlink['href']
        list_links.append(FinalLink)
        print(FinalLink)
        
        
        # Reading the content of article
        article = requests.get(FinalLink)
        article_content = article.content
        soup_article = BeautifulSoup(article_content, 'html5lib')
        body = soup_article.find_all('div', class_='mw-parser-output')
        x = body[0].find_all('p') # because articles are divided into paragraphs
    
        # collect all paragraphs
        list_paragraphs = []
        for p in np.arange(0, len(x)):
            paragraph = x[p].get_text()      # get the text of the paragraph
            paragraph = cleaning(paragraph)  # Make some cleaning
            list_paragraphs.append(paragraph)
            final_article = " ".join(list_paragraphs)
            
        frame.append([title,FinalLink,final_article])  # put all in a frame
    return frame

## Testing using a link

In [4]:
# Link:
url = 'https://en.wikipedia.org/wiki/Category:Finance'

In [5]:
frame = scrapArticle(url)

https://en.wikipedia.org/wiki/Renting
https://en.wikipedia.org/wiki/Help:Categories
https://en.wikipedia.org/wiki/Non-financial_asset
https://en.wikipedia.org/wiki/Master_of_Applied_Finance
https://en.wikipedia.org/wiki/Finance
https://en.wikipedia.org/wiki/Capital_Markets_Union
https://en.wikipedia.org/wiki/Non-financial_asset
https://en.wikipedia.org/wiki/Capital_Markets_Union
https://en.wikipedia.org/wiki/P2F
https://en.wikipedia.org/wiki/Master_of_Applied_Finance
https://en.wikipedia.org/wiki/Designated_Professional_Body
https://en.wikipedia.org/wiki/Financial_stability
https://en.wikipedia.org/wiki/Real_bills_doctrine
https://en.wikipedia.org/wiki/Asset
https://en.wikipedia.org/wiki/Brattle_Group
https://en.wikipedia.org/wiki/P2F
https://en.wikipedia.org/wiki/Help:Categories
https://en.wikipedia.org/wiki/Trade_exchange
https://en.wikipedia.org/wiki/Shadow_Banking_in_China
https://en.wikipedia.org/wiki/Help:Category


In [6]:
data=pd.DataFrame(frame, columns=['title','link','article content'])
data

Unnamed: 0,title,link,article content
0,Renting,https://en.wikipedia.org/wiki/Renting,renting also known as hiring or letting is an ...
1,category,https://en.wikipedia.org/wiki/Help:Categories,categories are used in wikipedia to link art...
2,Non-financial asset,https://en.wikipedia.org/wiki/Non-financial_asset,a nonfinancial asset is an asset that cannot b...
3,Master of Applied Finance,https://en.wikipedia.org/wiki/Master_of_Applie...,the master of finance is a masters degree aw...
4,Finance,https://en.wikipedia.org/wiki/Finance,finance is a term for matters regarding the ma...
5,Capital Markets Union,https://en.wikipedia.org/wiki/Capital_Markets_...,the capital markets union cmu is an ec...
6,Non-financial asset,https://en.wikipedia.org/wiki/Non-financial_asset,a nonfinancial asset is an asset that cannot b...
7,Capital Markets Union,https://en.wikipedia.org/wiki/Capital_Markets_...,the capital markets union cmu is an ec...
8,P2F,https://en.wikipedia.org/wiki/P2F,p f chinese 个人对金融机构 also known as p f model ...
9,Master of Applied Finance,https://en.wikipedia.org/wiki/Master_of_Applie...,the master of finance is a masters degree aw...


## Example of article content

In [7]:
frame[0][2]

'renting also known as hiring or letting is an agreement where a payment is made for the temporary use of a good service or property owned by another a gross lease is when the tenant pays a flat rental amount and the landlord pays for all property charges regularly incurred by the ownership an example of renting is equipment rental renting can be an example of the sharing economy  there are many possible reasons for renting instead of buying for example  shortterm rental of all sorts of products excluding real estate and holiday apartments already represents an estimated  billion  billion annual market in europe and is expected to grow further as the internet makes it easier to find specific items available for rent  according to a poll by yougov  of people looking to rent would go to the internet first to find what they need rising to  for those aged     it has been widely reported that the financial crisis of  may have contributed to the rapid growth of online rental marketplaces suc

# Testing other links

In [8]:
print('Finance: ')
Finance_frame = scrapArticle("https://en.wikipedia.org/wiki/Category:Finance")
print('Mathematics: ')
Mathematics_frame = scrapArticle("https://en.wikipedia.org/wiki/Category:Mathematics")
print('Sports: ')
Sports_frame = scrapArticle("https://en.wikipedia.org/wiki/Category:Sports")

Finance: 
https://en.wikipedia.org/wiki/Finance
https://en.wikipedia.org/wiki/Designated_Professional_Body
https://en.wikipedia.org/wiki/Designated_Professional_Body
https://en.wikipedia.org/wiki/Request_for_quote
https://en.wikipedia.org/wiki/Real_bills_doctrine
https://en.wikipedia.org/wiki/Asset
https://en.wikipedia.org/wiki/P2F
https://en.wikipedia.org/wiki/Brattle_Group
https://en.wikipedia.org/wiki/Shadow_Banking_in_China
https://en.wikipedia.org/wiki/JEL_classification_codes
https://en.wikipedia.org/wiki/Finance
https://en.wikipedia.org/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?
https://en.wikipedia.org/wiki/Help:Categories
https://en.wikipedia.org/wiki/JEL_classification_codes
https://en.wikipedia.org/wiki/Approved_Publication_Arrangement
https://en.wikipedia.org/wiki/Trade_exchange
https://en.wikipedia.org/wiki/Trade_exchange
https://en.wikipedia.org/wiki/Asset
https://en.wikipedia.org/wiki/Master_of_Applied_Finance
https://en.wikipedia.org/

In [9]:
Mathematics_articles = pd.DataFrame(Mathematics_frame , columns=['title','link','article content'])
Mathematics_articles

Unnamed: 0,title,link,article content
0,Quota rule,https://en.wikipedia.org/wiki/Quota_rule,in mathematics and political science the quota...
1,Composite methods for structural dynamics,https://en.wikipedia.org/wiki/Composite_method...,composite methods are an approach applied in s...
2,space,https://en.wikipedia.org/wiki/Space,space is the boundless threedimensional ex...
3,Colon,https://en.wikipedia.org/wiki/Colon_classifica...,colon classification cc is a system of library...
4,Archives of American Mathematics,https://en.wikipedia.org/wiki/Archives_of_Amer...,the archives of american mathematics located a...
5,Peano kernel theorem,https://en.wikipedia.org/wiki/Peano_kernel_the...,in numerical analysis the peano kernel theorem...
6,Composite methods for structural dynamics,https://en.wikipedia.org/wiki/Composite_method...,composite methods are an approach applied in s...
7,Pseudorandom graph,https://en.wikipedia.org/wiki/Pseudorandom_graph,in graph theory a graph is said to be a pseudo...
8,Mathematics in Nepal,https://en.wikipedia.org/wiki/Mathematics_in_N...,mathematics has been used in nepal for measure...
9,space,https://en.wikipedia.org/wiki/Space,space is the boundless threedimensional ex...


In [10]:
Sports_articles = pd.DataFrame(Sports_frame , columns=['title','link','article content'])
Sports_articles

Unnamed: 0,title,link,article content
0,Categories,https://en.wikipedia.org/wiki/Help:Category,categories are intended to group together pa...
1,Height in sports,https://en.wikipedia.org/wiki/Height_in_sports,height can significantly influence success in ...
2,Categories,https://en.wikipedia.org/wiki/Help:Category,categories are intended to group together pa...
3,Library of Congress,https://en.wikipedia.org/wiki/Library_of_Congr...,the library of congress classification lcc is...
4,Sports portal,https://en.wikipedia.org/wiki/Portal:Sports,sport includes all forms of competitive physic...
5,List of sports,https://en.wikipedia.org/wiki/List_of_sports,the following is a list of sportsgames divided...
6,Portal:Sports,https://en.wikipedia.org/wiki/Portal:Sports,sport includes all forms of competitive physic...
7,Pacha Nobin Jomoh,https://en.wikipedia.org/wiki/Pacha_Nobin_Jomoh,pacha nobin jomoh born st november is an ind...
8,Sport,https://en.wikipedia.org/wiki/Sport,sport includes all forms of competitive ph...
9,sport,https://en.wikipedia.org/wiki/Sport,sport includes all forms of competitive ph...


***
### Links
- [E-mail :](mailto:zakaria.abbou199434@gmail.com) zakaria.abbou199434@gmail.com
- [GitHub :](https://github.com/ZakariaAABBOU) github.com/ZakariaAABBOU
- [Linkedin :](https://www.linkedin.com/in/zakaria-aabbou/) linkedin.com/in/zakaria-aabbou/