# Proyecto web (Semana 3)

Para este proyecto nos fue solicitado que realizaramos dos tareas:
1. Hacer uso de un API para generar un dataset.
2. Aplicar web scraping para generar un dataset.

Estas dos tareas deben resultar en los siguientes archivos:
1. Un archivo ".csv" en el cual tengamos el dataset generado via API.
2. Un archivo ".csv" en el cual tengamos el dataset generado via API, aplicando labores de limpieza y manipulación.
3. Un archivo ".csv" en el cual tengamos el dataset generado via web scraping.
4. Un archivo ".csv" en el cual tengamos el dataset generado via web scraping, aplicando labores de limpieza y manipulación.

# Ideas para el proyecto.

Considerando que los datos son la materia prima para proyectos de analitica, decidí utilizar el API de un gran sitio (Kaggle) que contiene datasets sobre diferentes temas, la gran mayoria de manera pública.

En el caso del web scraping decidí tomar una página sobre inversión en diferentes mercados, derivados y activos (Investing.com), tema que es de mi interes y por el momento el sitio no posee un API.

# API 

Lo primero que realice para el uso del API, fue instalar un wrapper que ofrece el API de Kaggle.

In [1]:
#import sys
#!{sys.executable} -m pip install mdutils kaggle 

Continue con una celda para realizar todos los imports que se vayan requiriendo a lo largo del proyecto.

In [63]:
import requests
from kaggle.api.kaggle_api_extended import KaggleApi
import time
import pandas as pd
import json
import operator
from bs4 import BeautifulSoup
import re

El wrapper del API de Kaggle realiza la autentificación con los siguientes comandos.

In [42]:
api = KaggleApi({"username":"","key":""})
api.authenticate()

En este momento estamos autorizados para utilizar el API a partir de todos los métodos que provee el wrapper. En teoría el API de kaggle es más fácil de usar desde un shell, y su documentación (https://github.com/Kaggle/kaggle-api) esta redactada para su uso en shell. Pero es completamente factible traducir todos sus comandos al metodo incluido en el wrapper. Algunos de los comandos disponibles se enlistan a continuación:

| Sección:                       | Competitions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                  |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Comando                        | Parametros                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Descripción                                                                                                                                                                      |
| dataset_list()                 | sort_by: how to sort the result, see valid_dataset_sort_bys for options size: the size of the dataset, see valid_dataset_sizes for string options file_type: the format, see valid_dataset_file_types for string options license_name: string descriptor for license, see valid_dataset_license_names tag_ids: tag identifiers to filter the search search: a search term to use (default is empty string) user: username to filter the search to mine: boolean if True, group is changed to "my" to return personal page: the page to return (default is 1) | Comando para realizar búsqueda de datasets, los parámetros extra permiten ordenarlos, filtrar por tags, página que obtener, buscar datasets por usuario y otras caracteristicas. |
| dataset_view()                 | :param str owner_slug: Dataset owner (required) :param str dataset_slug: Dataset name (required)                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Ver metadatos de un dataset.                                                                                                                                                     |
| dataset_metadata(dataset,path) | dataset: name dataset path: its obtain with the name of the dataset.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Ver metadatos de un dataset.                                                                                                                                                     |
| dataset_list_files(dataset)    | dataset: the string identified of the dataset should be in format [owner]/[dataset-name]                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Lista los archivos presentes en el dataset.                                                                                                                                      |
| dataset_download_file()        | dataset: the string identified of the dataset should be in format [owner]/[dataset-name] file_name: the dataset configuration file path: if defined, download to this location force: force the download if the file already exists (default False) quiet: suppress verbose output (default is True)                                                                                                                                                                                                                                                         | Descarga un archivo presente  en un dataset.                                                                                                                                     |
| dataset_download_files()       | dataset: the string identified of the dataset should be in format [owner]/[dataset-name] path: the path to download the dataset to force: force the download if the file already exists (default False) quiet: suppress verbose output (default is True) unzip: if True, unzip files upon download (default is False)                                                                                                                                                                                                                                        | Descacarga todos los archivos  de un dataset.                                                                                                                                    |
| download_file()                | response: the response to download outfile: the output file to download to quiet: suppress verbose output (default is True) chunk_size: the size of the chunk to stream                                                                                                                                                                                                                                                                                                                                                                                      | También descarga un archivo.                                                                                                                                                     |

## Extraer data.

Para mi proyecto me interesa obtener datasets que tengan relación palabras clave que yo seleccione, para esto construyo una lista con dichas palabras, en ella preferentemente hay que agregar palabras en inglés.

In [44]:
intereses = ['currencies','currency','forex','finance','exchanges','tweets','news','fake news']

Continue realizando una búsqueda en la API con cada interes, decidí agregar una pausa entre cada solicitud a la API de un 1.5s.

In [None]:
datasets_category = pd.DataFrame()
result_busqueda_list = []
categoria_list = []

for interes in intereses:
    time.sleep(1.5)
    response = api.dataset_list(search=interes)
    if len(response) != 0:
        result_busqueda_list.extend(response)
        categoria_list.extend(((interes+',')*len(response)).split(',')[:-1])
    
datasets_category['Dataset'] = result_busqueda_list
datasets_category['Category'] = categoria_list

In [84]:
print('Se obtuvieron %i datasets en la busqueda sobre los interes seleccionados.' % len(datasets_category))

Se obtuvieron 152 datasets en la busqueda sobre los interes seleccionados.


Revisamos si existen repeticiones en los resultados de busqueda.

In [85]:
if len(set(datasets_category['Dataset'])) != len(datasets_category):
    print('Existen datasets repetidos')
else:
    print('No hay datasets repetidos')

No hay datasets repetidos


Para cada uno de los datasets encontrados en la búsqueda descargaremos sus metadatos.

In [161]:
metadata_datasets_list = []
for dataset in datasets_category['Dataset']:
    time.sleep(1)
    owner_name = str(dataset).split('/')[0]
    name = str(dataset).split('/')[1]
    metadata_datasets_list.append(api.datasets_view(owner_name,name))

In [158]:
metadata_keys = ['id', 'ref', 'subtitle', 'tags', 'creatorName', 'creatorUrl',
                 'totalBytes', 'url', 'lastUpdated', 'downloadCount', 'isPrivate',
                 'isReviewed', 'isFeatured', 'licenseName', 'description', 'ownerName',
                 'ownerRef', 'kernelCount', 'title', 'topicCount', 'viewCount', 'voteCount',
                 'currentVersionNumber', 'files', 'versions', 'usabilityRating']

In [196]:
metadata_df = pd.DataFrame(metadata_datasets_list)
metadata_df

Unnamed: 0,creatorName,creatorUrl,currentVersionNumber,description,downloadCount,files,id,isFeatured,isPrivate,isReviewed,...,subtitle,tags,title,topicCount,totalBytes,url,usabilityRating,versions,viewCount,voteCount
0,jvent,,17,# Cryptocurrency Market Data\n## Historical Cr...,9028,"[{'ref': 'crypto-markets.csv', 'creationDate':...",1963,False,False,True,...,"Daily crypto markets open, close, low, high da...","[{'ref': 'finance', 'competitionCount': 4, 'da...",Every Cryptocurrency Daily Market Price,29,23636187,https://www.kaggle.com/jessevent/all-crypto-cu...,0.852941,"[{'versionNumber': 17, 'creationDate': '2018-1...",76165,378
1,Albert Costas,,8,«Datasets per la comparació de moviments i pat...,565,"[{'ref': '1_11_2017_crypto_currencies.csv', 'c...",2963,False,False,False,...,Cryptocurrency Market Capitalizations,"[{'ref': 'finance', 'competitionCount': 4, 'da...",Crypto Currencies,0,1321667,https://www.kaggle.com/acostasg/crypto-currencies,0.705882,"[{'versionNumber': 8, 'creationDate': '2017-12...",7763,12
2,Albert Costas,,2,«Datasets per la comparació de moviments i pat...,515,"[{'ref': 'dataset.csv', 'creationDate': '2017-...",6902,False,False,True,...,Relation and patterns between movements of sto...,"[{'ref': 'economics', 'competitionCount': 0, '...",Analysis about crypto currencies and Stock Index,0,681413,https://www.kaggle.com/acostasg/cryptocurrenci...,0.705882,"[{'versionNumber': 2, 'creationDate': '2017-12...",5817,18
3,mitillo,,3,### Context\n\nThis is a different timeframe c...,134,"[{'ref': 'currencies.rar', 'creationDate': '20...",2661,False,False,False,...,,[],Currencies,0,1151577,https://www.kaggle.com/mitillo/currencies,0.411765,"[{'versionNumber': 3, 'creationDate': '2017-09...",1231,2
4,Sebastian,,2,This dataset contains the daily currency excha...,516,[{'ref': 'currency_exchange_rates_02-01-1995_-...,20872,False,False,False,...,Daily exchange rates for 51 currencies from 19...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",Currency Exchange Rates,1,596854,https://www.kaggle.com/thebasss/currency-excha...,0.647059,"[{'versionNumber': 2, 'creationDate': '2018-05...",2591,21
5,Albert Costas,,1,«Datasets per la comparació de moviments i pat...,160,"[{'ref': '1_11_2017_crypto_currencies.csv', 'c...",4161,False,False,False,...,Cryptocurrency Market Capitalizations,"[{'ref': 'finance', 'competitionCount': 4, 'da...",Crypto Currencies,0,1082628,https://www.kaggle.com/acostasg/crypto-currenc...,0.647059,"[{'versionNumber': 1, 'creationDate': '2017-11...",1484,1
6,Pablo Lopez Santori,,1,### Context\n\nI put together this dataset whe...,69,"[{'ref': 'cat_to_name.json', 'creationDate': '...",150253,False,False,False,...,A collection of coin images from 32 different ...,"[{'ref': 'image data', 'competitionCount': 63,...",World Coins,0,480602984,https://www.kaggle.com/wanderdust/coin-images,0.937500,"[{'versionNumber': 1, 'creationDate': '2019-03...",408,5
7,Luigi,,1,### Content\n\nover 10 years of historical exc...,102,"[{'ref': 'exchange.csv', 'creationDate': '2017...",1407,False,False,False,...,historical data monthly frequencies 01/07/1997...,"[{'ref': 'economics', 'competitionCount': 0, '...",Exchange rate BRIC currencies/US dollar,0,3657,https://www.kaggle.com/luigimersico/exchange-r...,0.529412,"[{'versionNumber': 1, 'creationDate': '2017-06...",1019,3
8,SRK,,13,"### Context\n\nThings like Block chain, Bitcoi...",19238,"[{'ref': 'bitcoin_cash_price.csv', 'creationDa...",1869,False,False,False,...,Prices of top cryptocurrencies including Bitco...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",Cryptocurrency Historical Prices,12,715347,https://www.kaggle.com/sudalairajkumar/cryptoc...,0.705882,"[{'versionNumber': 13, 'creationDate': '2018-0...",140217,343
9,Ulas Can Cengiz,,1,### Context\n\nHere's one of the largest Crypt...,101,"[{'ref': 'cc_histories.zip', 'creationDate': '...",30652,False,False,False,...,Historical Coin Prices to Understand the Big P...,"[{'ref': 'economics', 'competitionCount': 0, '...",Price History of 1654 Crypto-Currencies,0,19131516,https://www.kaggle.com/ulascengiz/price-histor...,0.687500,"[{'versionNumber': 1, 'creationDate': '2018-06...",810,4


## Selección de columnas

Seleccioné las columnas que considero útiles.

In [238]:
useful_key_metadata = ['title','subtitle','description','lastUpdated','ref','totalBytes','url','tags','downloadCount','licenseName',
                       'kernelCount','versions','usabilityRating'] 

In [239]:
dataset = metadata_df[useful_key_metadata]
dataset

Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalBytes,url,tags,downloadCount,licenseName,kernelCount,versions,usabilityRating
0,Every Cryptocurrency Daily Market Price,"Daily crypto markets open, close, low, high da...",# Cryptocurrency Market Data\n## Historical Cr...,2018-12-01T13:56:58.277Z,jessevent/all-crypto-currencies,23636187,https://www.kaggle.com/jessevent/all-crypto-cu...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",9028,Other (specified in description),63,"[{'versionNumber': 17, 'creationDate': '2018-1...",0.852941
1,Crypto Currencies,Cryptocurrency Market Capitalizations,«Datasets per la comparació de moviments i pat...,2017-12-03T18:55:04.34Z,acostasg/crypto-currencies,1321667,https://www.kaggle.com/acostasg/crypto-currencies,"[{'ref': 'finance', 'competitionCount': 4, 'da...",565,"Database: Open Database, Contents: Database Co...",2,"[{'versionNumber': 8, 'creationDate': '2017-12...",0.705882
2,Analysis about crypto currencies and Stock Index,Relation and patterns between movements of sto...,«Datasets per la comparació de moviments i pat...,2017-12-13T22:38:33.32Z,acostasg/cryptocurrenciesvsstockindex,681413,https://www.kaggle.com/acostasg/cryptocurrenci...,"[{'ref': 'economics', 'competitionCount': 0, '...",515,"Database: Open Database, Contents: © Original ...",1,"[{'versionNumber': 2, 'creationDate': '2017-12...",0.705882
3,Currencies,,### Context\n\nThis is a different timeframe c...,2017-09-24T19:50:59.687Z,mitillo/currencies,1151577,https://www.kaggle.com/mitillo/currencies,[],134,Unknown,1,"[{'versionNumber': 3, 'creationDate': '2017-09...",0.411765
4,Currency Exchange Rates,Daily exchange rates for 51 currencies from 19...,This dataset contains the daily currency excha...,2018-05-02T17:48:28.943Z,thebasss/currency-exchange-rates,596854,https://www.kaggle.com/thebasss/currency-excha...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",516,CC0: Public Domain,0,"[{'versionNumber': 2, 'creationDate': '2018-05...",0.647059
5,Crypto Currencies,Cryptocurrency Market Capitalizations,«Datasets per la comparació de moviments i pat...,2017-11-07T20:19:07.32Z,acostasg/crypto-currencies-data,1082628,https://www.kaggle.com/acostasg/crypto-currenc...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",160,"Database: Open Database, Contents: Database Co...",0,"[{'versionNumber': 1, 'creationDate': '2017-11...",0.647059
6,World Coins,A collection of coin images from 32 different ...,### Context\n\nI put together this dataset whe...,2019-03-27T09:26:10.133Z,wanderdust/coin-images,480602984,https://www.kaggle.com/wanderdust/coin-images,"[{'ref': 'image data', 'competitionCount': 63,...",69,Other (specified in description),1,"[{'versionNumber': 1, 'creationDate': '2019-03...",0.937500
7,Exchange rate BRIC currencies/US dollar,historical data monthly frequencies 01/07/1997...,### Content\n\nover 10 years of historical exc...,2017-06-15T14:52:31.757Z,luigimersico/exchange-rate-bric-currenciesus-d...,3657,https://www.kaggle.com/luigimersico/exchange-r...,"[{'ref': 'economics', 'competitionCount': 0, '...",102,Unknown,2,"[{'versionNumber': 1, 'creationDate': '2017-06...",0.529412
8,Cryptocurrency Historical Prices,Prices of top cryptocurrencies including Bitco...,"### Context\n\nThings like Block chain, Bitcoi...",2018-02-21T12:36:47.22Z,sudalairajkumar/cryptocurrencypricehistory,715347,https://www.kaggle.com/sudalairajkumar/cryptoc...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",19238,CC0: Public Domain,39,"[{'versionNumber': 13, 'creationDate': '2018-0...",0.705882
9,Price History of 1654 Crypto-Currencies,Historical Coin Prices to Understand the Big P...,### Context\n\nHere's one of the largest Crypt...,2018-06-09T02:44:13.39Z,ulascengiz/price-history-of-1654-cryptocurrencies,19131516,https://www.kaggle.com/ulascengiz/price-histor...,"[{'ref': 'economics', 'competitionCount': 0, '...",101,Other (specified in description),0,"[{'versionNumber': 1, 'creationDate': '2018-06...",0.687500


## Identificando valores nulos.

Buscamos datos que no sean un valor.

In [240]:
missing_values = dataset.isna().sum()
missing_values

title              0
subtitle           0
description        0
lastUpdated        0
ref                0
totalBytes         0
url                0
tags               0
downloadCount      0
licenseName        0
kernelCount        0
versions           0
usabilityRating    0
dtype: int64

No se encontraron valores nulos.

## Manipulación del dataset

Agregamos la columna de categoría de búsqueda con la que iniciamos.

In [241]:
dataset['category'] = categoria_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### Columna total bytes.

Modificamos el valor base de la columna de "Bytes" a "Mega bytes" 

In [242]:
byte_to_gb = lambda x: x/1000000
dataset["totalBytes"] = dataset["totalBytes"].apply(byte_to_gb)
dataset = dataset.rename(columns = {"totalBytes":"totalGigaBytes"})
dataset.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalGigaBytes,url,tags,downloadCount,licenseName,kernelCount,versions,usabilityRating,category
0,Every Cryptocurrency Daily Market Price,"Daily crypto markets open, close, low, high da...",# Cryptocurrency Market Data\n## Historical Cr...,2018-12-01T13:56:58.277Z,jessevent/all-crypto-currencies,23.636187,https://www.kaggle.com/jessevent/all-crypto-cu...,"[{'ref': 'finance', 'competitionCount': 4, 'da...",9028,Other (specified in description),63,"[{'versionNumber': 17, 'creationDate': '2018-1...",0.852941,currencies
1,Crypto Currencies,Cryptocurrency Market Capitalizations,«Datasets per la comparació de moviments i pat...,2017-12-03T18:55:04.34Z,acostasg/crypto-currencies,1.321667,https://www.kaggle.com/acostasg/crypto-currencies,"[{'ref': 'finance', 'competitionCount': 4, 'da...",565,"Database: Open Database, Contents: Database Co...",2,"[{'versionNumber': 8, 'creationDate': '2017-12...",0.705882,currencies
2,Analysis about crypto currencies and Stock Index,Relation and patterns between movements of sto...,«Datasets per la comparació de moviments i pat...,2017-12-13T22:38:33.32Z,acostasg/cryptocurrenciesvsstockindex,0.681413,https://www.kaggle.com/acostasg/cryptocurrenci...,"[{'ref': 'economics', 'competitionCount': 0, '...",515,"Database: Open Database, Contents: © Original ...",1,"[{'versionNumber': 2, 'creationDate': '2017-12...",0.705882,currencies


## Columna Tags.

En la columna "tags" encontramos referencias a los grupos en los cuales se encuentra clasificado el dataset.
Esta columna varia de dataset a dataset. Sin embargo nos permite tener aún más grupos sobre los cuales realizar
busquedas con resultados que puedan ser de interes al usuario.

Para esta columna realizaremos una extracción de todas las etiquetas y obtenemos la frecuencia de un set para evitar repeticiones,además de que presentaremos las tres más frecuentes al usuario de manera que este pueda usarlas en una búsqueda de intereses aún mayor.

In [243]:
suggest_interest = [element['ref'] for tag in dataset['tags'] for element in tag]
set_suggest = set(suggest_interest)
dict_freq_suggest = {k:suggest_interest.count(k) for k in set_suggest}
sorted_tups = sorted(dict_freq_suggest.items(), key=operator.itemgetter(1))
print(sorted_tups[-10:])

[('nlp', 8), ('twitter', 11), ('news agencies', 13), ('politics', 15), ('business', 15), ('money', 15), ('economics', 17), ('linguistics', 18), ('internet', 35), ('finance', 56)]


Estas sugerencias se pueden interpretar como los hashtag que contiene el dataset, por tanto hay que ser cuidadosos al seleccionar nuevos intereses de la lista.

Así también tenemos valores vacios para la columna etiquetas, así que sustituiremos la columna tags
por "number of tags"

In [244]:
num_tags = lambda x: len(x)
dataset["tags"] = dataset["tags"].apply(num_tags)
dataset = dataset.rename(columns = {"tags":"numberOfTags"})
dataset.head(3)

Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalGigaBytes,url,numberOfTags,downloadCount,licenseName,kernelCount,versions,usabilityRating,category
0,Every Cryptocurrency Daily Market Price,"Daily crypto markets open, close, low, high da...",# Cryptocurrency Market Data\n## Historical Cr...,2018-12-01T13:56:58.277Z,jessevent/all-crypto-currencies,23.636187,https://www.kaggle.com/jessevent/all-crypto-cu...,3,9028,Other (specified in description),63,"[{'versionNumber': 17, 'creationDate': '2018-1...",0.852941,currencies
1,Crypto Currencies,Cryptocurrency Market Capitalizations,«Datasets per la comparació de moviments i pat...,2017-12-03T18:55:04.34Z,acostasg/crypto-currencies,1.321667,https://www.kaggle.com/acostasg/crypto-currencies,2,565,"Database: Open Database, Contents: Database Co...",2,"[{'versionNumber': 8, 'creationDate': '2017-12...",0.705882,currencies
2,Analysis about crypto currencies and Stock Index,Relation and patterns between movements of sto...,«Datasets per la comparació de moviments i pat...,2017-12-13T22:38:33.32Z,acostasg/cryptocurrenciesvsstockindex,0.681413,https://www.kaggle.com/acostasg/cryptocurrenci...,3,515,"Database: Open Database, Contents: © Original ...",1,"[{'versionNumber': 2, 'creationDate': '2017-12...",0.705882,currencies


## Columna Versions

La columna versions contiene al menos una versión para el dataset, sin embargo en caso de que tenga más solo sería
de nuestro interes la última versión y su fecha.

In [245]:
dataset["versions"][0][0]

{'versionNumber': 17,
 'creationDate': '2018-12-01T13:56:58.277Z',
 'creatorName': 'jvent',
 'creatorRef': 'all-crypto-currencies',
 'versionNotes': 'Updated data as of 30/11/2018',
 'status': 'Ready'}

In [246]:
last_version = lambda x: str(x[0]['status']) + ' version: ' +  str(x[0]['versionNumber']) + ', ' + str(x[0]['creationDate'])
dataset["versions"] = dataset["versions"].apply(last_version)
dataset = dataset.rename(columns = {"versions":"lastVersion"})
dataset.head(3)

Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalGigaBytes,url,numberOfTags,downloadCount,licenseName,kernelCount,lastVersion,usabilityRating,category
0,Every Cryptocurrency Daily Market Price,"Daily crypto markets open, close, low, high da...",# Cryptocurrency Market Data\n## Historical Cr...,2018-12-01T13:56:58.277Z,jessevent/all-crypto-currencies,23.636187,https://www.kaggle.com/jessevent/all-crypto-cu...,3,9028,Other (specified in description),63,"Ready version: 17, 2018-12-01T13:56:58.277Z",0.852941,currencies
1,Crypto Currencies,Cryptocurrency Market Capitalizations,«Datasets per la comparació de moviments i pat...,2017-12-03T18:55:04.34Z,acostasg/crypto-currencies,1.321667,https://www.kaggle.com/acostasg/crypto-currencies,2,565,"Database: Open Database, Contents: Database Co...",2,"Ready version: 8, 2017-12-03T18:55:04.34Z",0.705882,currencies
2,Analysis about crypto currencies and Stock Index,Relation and patterns between movements of sto...,«Datasets per la comparació de moviments i pat...,2017-12-13T22:38:33.32Z,acostasg/cryptocurrenciesvsstockindex,0.681413,https://www.kaggle.com/acostasg/cryptocurrenci...,3,515,"Database: Open Database, Contents: © Original ...",1,"Ready version: 2, 2017-12-13T22:38:33.32Z",0.705882,currencies


## Análisis.

Finalmente por ahora, podemos utilizar cada una de las columnas disponibles para realizar algunos filtros con los cuales obtener información interesante.

### Mejor score de utilidad.

In [247]:
dataset.sort_values(['usabilityRating'],ascending=False)

Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalGigaBytes,url,numberOfTags,downloadCount,licenseName,kernelCount,lastVersion,usabilityRating,category
95,Australian Election 2019 Tweets,"May 18th 2019, 180k+ tweets",### Context\n\nDuring the 2019 Australian elec...,2019-05-21T09:41:38.763Z,taniaj/australian-election-2019-tweets,29.972572,https://www.kaggle.com/taniaj/australian-elect...,5,2366,CC0: Public Domain,8,"Ready version: 2, 2019-05-21T09:41:38.763Z",1.000000,tweets
80,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,### Context \nBitcoin is the longest running a...,2019-03-15T16:22:58.397Z,mczielinski/bitcoin-historical-data,123.326534,https://www.kaggle.com/mczielinski/bitcoin-his...,2,43214,CC BY-SA 4.0,128,"Ready version: 16, 2019-03-15T16:22:58.397Z",1.000000,exchanges
28,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,### Context \nBitcoin is the longest running a...,2019-03-15T16:22:58.397Z,mczielinski/bitcoin-historical-data,123.326534,https://www.kaggle.com/mczielinski/bitcoin-his...,2,43214,CC BY-SA 4.0,128,"Ready version: 16, 2019-03-15T16:22:58.397Z",1.000000,currency
136,News Headlines Dataset For Sarcasm Detection,High quality dataset for the task of Sarcasm D...,#Context\n\nPast studies in Sarcasm Detection ...,2019-07-03T23:52:57.127Z,rmisra/news-headlines-dataset-for-sarcasm-dete...,3.425749,https://www.kaggle.com/rmisra/news-headlines-d...,4,5694,CC0: Public Domain,48,"Ready version: 2, 2019-07-03T23:52:57.127Z",1.000000,fake news
64,Mutual Funds and ETFs,25k+ Mutual Funds and 2k+ ETFs scraped from Ya...,### Context\n\nETFs represent a cheap alternat...,2019-05-04T02:00:37.827Z,stefanoleone992/mutual-funds-and-etfs,4.547400,https://www.kaggle.com/stefanoleone992/mutual-...,5,597,CC0: Public Domain,2,"Ready version: 3, 2019-05-04T02:00:37.827Z",1.000000,finance
130,Yet Another Chinese News Dataset,"With Article Titles, Descriptions, Cover Image...",A collections of news articles in Traditional ...,2019-07-11T17:01:17.377Z,ceshine/yet-another-chinese-news-dataset,24.983959,https://www.kaggle.com/ceshine/yet-another-chi...,3,137,CC BY-SA 4.0,4,"Ready version: 7, 2019-07-11T17:01:17.377Z",1.000000,news
26,401 crypto currency pairs at 1-minute resolution,Historical crypto currency data from the Bitfi...,## About this dataset\n\nWith the rise of cryp...,2019-07-09T21:25:22.227Z,tencars/392-crypto-currency-pairs-at-minute-re...,393.638972,https://www.kaggle.com/tencars/392-crypto-curr...,3,94,CC BY-SA 4.0,2,"Ready version: 2, 2019-07-09T21:25:22.227Z",1.000000,currency
113,News Category Dataset,Identify the type of news based on headlines a...,# Context\nThis dataset contains around 200k n...,2018-12-02T04:09:45.777Z,rmisra/news-category-dataset,26.337702,https://www.kaggle.com/rmisra/news-category-da...,4,5165,CC0: Public Domain,22,"Ready version: 2, 2018-12-02T04:09:45.777Z",1.000000,news
74,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,### Context \nBitcoin is the longest running a...,2019-03-15T16:22:58.397Z,mczielinski/bitcoin-historical-data,123.326534,https://www.kaggle.com/mczielinski/bitcoin-his...,2,43214,CC BY-SA 4.0,128,"Ready version: 16, 2019-03-15T16:22:58.397Z",1.000000,finance
120,News Headlines Dataset For Sarcasm Detection,High quality dataset for the task of Sarcasm D...,#Context\n\nPast studies in Sarcasm Detection ...,2019-07-03T23:52:57.127Z,rmisra/news-headlines-dataset-for-sarcasm-dete...,3.425749,https://www.kaggle.com/rmisra/news-headlines-d...,4,5694,CC0: Public Domain,48,"Ready version: 2, 2019-07-03T23:52:57.127Z",1.000000,news


### Mayor tamaño

In [249]:
dataset.sort_values(['totalGigaBytes'],ascending=False)

Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalGigaBytes,url,numberOfTags,downloadCount,licenseName,kernelCount,lastVersion,usabilityRating,category
16,Ethereum Blockchain,Complete live historical Ethereum blockchain d...,## Context\n\nBitcoin and other cryptocurrenci...,2019-03-04T14:57:55.953Z,bigquery/ethereum-blockchain,910127.001043,https://www.kaggle.com/bigquery/ethereum-block...,5,0,CC0: Public Domain,20,"Ready version: 4, 2019-03-04T14:57:55.953Z",0.705882,currencies
112,Hacker News,All posts from Y Combinator's social news webs...,### Context\n\nThis dataset contains all stori...,2019-02-12T00:34:51.853Z,hacker-news/hacker-news,15883.923392,https://www.kaggle.com/hacker-news/hacker-news,4,0,CC0: Public Domain,1495,"Ready version: 2, 2019-02-12T00:34:51.853Z",0.705882,news
48,Forex RSI and BBPP multiperiod (m1-h4),,,2018-11-11T11:52:30.603Z,yurisa2/forex-rsi-and-bbpp-multiperiod-m1h4,5565.310574,https://www.kaggle.com/yurisa2/forex-rsi-and-b...,0,93,Unknown,2,"Ready version: 2, 2018-11-11T11:52:30.603Z",0.176471,forex
131,Old Newspapers,A cleaned subset of HC Corpora newspapers,### Context\n\nThe [HC Corpora](https://web.ar...,2017-11-16T04:53:55.98Z,alvations/old-newspapers,2196.786581,https://www.kaggle.com/alvations/old-newspapers,4,847,CC0: Public Domain,3,"Ready version: 6, 2017-11-16T04:53:55.98Z",0.750000,news
76,NYC Parking Tickets,"42.3M Rows of Parking Ticket Data, Aug 2013-Ju...",### Context\n\nThe NYC Department of Finance c...,2017-10-26T18:47:45.14Z,new-york-city/nyc-parking-tickets,2171.622562,https://www.kaggle.com/new-york-city/nyc-parki...,4,8074,CC0: Public Domain,6,"Ready version: 2, 2017-10-26T18:47:45.14Z",0.823529,finance
25,Iraqi Money العملة العراقية,Object detection dataset for Iraqi currency,### Object detection dataset for Iraqi currenc...,2018-08-23T09:28:29.143Z,husamaamer/iraqi-currency-,1435.021165,https://www.kaggle.com/husamaamer/iraqi-currency-,4,40,Unknown,2,"Ready version: 2, 2018-08-23T09:28:29.143Z",0.687500,currency
33,Nepali Currency,,,2018-10-31T18:15:52.017Z,thevirusx3/nepali-currency,1073.942714,https://www.kaggle.com/thevirusx3/nepali-currency,0,12,Unknown,1,"Ready version: 4, 2018-10-31T18:15:52.017Z",0.125000,currency
44,EURUSD jan/2014 - oct/2018,"Forex with a ton of indicators, MQL5 retrieved...","Forex with a ton of indicators, MQL5 retrieved...",2018-10-04T01:37:53Z,yurisa2/eurusd-2014-2018,1017.438780,https://www.kaggle.com/yurisa2/eurusd-2014-2018,4,74,CC0: Public Domain,2,"Ready version: 3, 2018-10-04T01:37:53Z",0.647059,forex
24,Binance Crypto Klines,"Minutely crypto currency open/close prices, hi...",### Context\n\nEach file contains klines for 1...,2018-04-08T09:58:41.477Z,binance/binance-crypto-klines,1004.510014,https://www.kaggle.com/binance/binance-crypto-...,5,486,CC0: Public Domain,1,"Ready version: 5, 2018-04-08T09:58:41.477Z",0.750000,currency
68,Lending Club Loan Data,Analyze Lending Club's issued loans,These files contain complete loan data for all...,2019-03-18T18:43:12.857Z,wendykan/lending-club-loan-data,736.483000,https://www.kaggle.com/wendykan/lending-club-l...,1,53334,Unknown,584,"Ready version: 1, 2019-03-18T18:43:12.857Z",0.735294,finance


### El más utilizado.

In [250]:
dataset.sort_values(['kernelCount'],ascending=False)

Unnamed: 0,title,subtitle,description,lastUpdated,ref,totalGigaBytes,url,numberOfTags,downloadCount,licenseName,kernelCount,lastVersion,usabilityRating,category
77,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Context\n---------\n\nIt is important that cre...,2018-03-23T01:17:27.913Z,mlg-ulb/creditcardfraud,69.155632,https://www.kaggle.com/mlg-ulb/creditcardfraud,3,136202,"Database: Open Database, Contents: Database Co...",2132,"Ready version: 3, 2018-03-23T01:17:27.913Z",0.852941,finance
112,Hacker News,All posts from Y Combinator's social news webs...,### Context\n\nThis dataset contains all stori...,2019-02-12T00:34:51.853Z,hacker-news/hacker-news,15883.923392,https://www.kaggle.com/hacker-news/hacker-news,4,0,CC0: Public Domain,1495,"Ready version: 2, 2019-02-12T00:34:51.853Z",0.705882,news
68,Lending Club Loan Data,Analyze Lending Club's issued loans,These files contain complete loan data for all...,2019-03-18T18:43:12.857Z,wendykan/lending-club-loan-data,736.483000,https://www.kaggle.com/wendykan/lending-club-l...,1,53334,Unknown,584,"Ready version: 1, 2019-03-18T18:43:12.857Z",0.735294,finance
78,Daily News for Stock Market Prediction,Using 8 years daily news headlines to predict ...,"Actually, I prepare this dataset for students ...",2016-08-25T16:56:51.32Z,aaron7sun/stocknews,6.384909,https://www.kaggle.com/aaron7sun/stocknews,2,23346,CC BY-NC-SA 4.0,306,"Ready version: 1, 2016-08-25T16:56:51.32Z",0.882353,finance
118,Daily News for Stock Market Prediction,Using 8 years daily news headlines to predict ...,"Actually, I prepare this dataset for students ...",2016-08-25T16:56:51.32Z,aaron7sun/stocknews,6.384909,https://www.kaggle.com/aaron7sun/stocknews,2,23346,CC BY-NC-SA 4.0,306,"Ready version: 1, 2016-08-25T16:56:51.32Z",0.882353,news
36,Kaggle Machine Learning & Data Science Survey ...,A big picture view of the state of data scienc...,"### Context\n\nFor the first time, Kaggle cond...",2017-10-27T22:03:03.417Z,kaggle/kaggle-survey-2017,3.692041,https://www.kaggle.com/kaggle/kaggle-survey-2017,3,16028,"Database: Open Database, Contents: © Original ...",296,"Ready version: 4, 2017-10-27T22:03:03.417Z",0.823529,currency
72,New York Stock Exchange,S&P 500 companies historical prices with funda...,# Context \n\nThis dataset is a playground for...,2017-02-22T10:18:25.517Z,dgawlik/nyse,34.402357,https://www.kaggle.com/dgawlik/nyse,1,29443,CC0: Public Domain,271,"Ready version: 3, 2017-02-22T10:18:25.517Z",0.852941,finance
20,Demonetization in India Twitter Data,Data extracted from Twitter regarding the rece...,# Context\n\nThe **demonetization of ₹500 and ...,2017-04-21T17:35:02.253Z,arathee2/demonetization-in-india-twitter-data,0.990156,https://www.kaggle.com/arathee2/demonetization...,4,4779,Unknown,171,"Ready version: 3, 2017-04-21T17:35:02.253Z",0.735294,currency
28,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,### Context \nBitcoin is the longest running a...,2019-03-15T16:22:58.397Z,mczielinski/bitcoin-historical-data,123.326534,https://www.kaggle.com/mczielinski/bitcoin-his...,2,43214,CC BY-SA 4.0,128,"Ready version: 16, 2019-03-15T16:22:58.397Z",1.000000,currency
74,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,### Context \nBitcoin is the longest running a...,2019-03-15T16:22:58.397Z,mczielinski/bitcoin-historical-data,123.326534,https://www.kaggle.com/mczielinski/bitcoin-his...,2,43214,CC BY-SA 4.0,128,"Ready version: 16, 2019-03-15T16:22:58.397Z",1.000000,finance


# Web scraping.

Algo que siempre me ha gustado son los modelos 3D, estos pueden ser útiles para empresas de videojuegos, para la industria de la animación y algunos otros sectores.

La primera página que encontre con cientos de modelos de pago y gratuitos para descargar fue: https://www.turbosquid.com/

Mi objetivo será hacer web scraping a su sitio y obtener información útil para cada uno de los modelos 3D diponibles.

Generé la clase que utilizaré para realizar el scraping tomando como base el trabajo en el lab de web scraping avanzado.

In [33]:
class WebSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        self.output = []
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        if str(response) != '<Response [200]>':
            print('Error en la respuesta del servidor',response)
        elif str(response) == '<Response [408]>':
            print('Error en el limite de tiempo de respuesta del servidor')
        elif str(response) == '<Response [429]>':
            print('Error demasiadas peticiones')
        # I didn't find the SSL error but I add a 404 error catching.
        elif str(response) == '<Response [404]>':
            print('No se encontro el contenido')
        else:
            result = self.content_parser(response.content)
            self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        self.output.append(r)
        print('Se agregaron %i url a la lista' % (len(r)))
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            if self.sleep_interval > 0:
                time.sleep(self.sleep_interval)
                self.scrape_url(self.url_pattern % i)
            else:
                self.scrape_url(self.url_pattern % i)
        return self.output

La estructura del sítio resulto ser no tan mala, sin duda una más que utiliza los div de manera impulsiva pero encontramos el url del modelo en los "div" de clase "thumbnail thumbnail-md".

La estructura para el paginado es la siguiente:

"https://www.turbosquid.com/Search/3D-Models?page_num=2&sort_column=a5&sort_order=asc"

Ya que he utilizado sus herramientas de filtrado para ordenar de menor a mayor costo, y tener los modelos gratuitos al inicio.


In [6]:
# Hacemos una prueba para la página 0
response = requests.get('https://www.turbosquid.com/Search/3D-Models?sort_column=a5&sort_order=asc')
print(response)
content = response.content
soup_prueba = BeautifulSoup(content,'html')
divs_modelos = soup_prueba.find_all('div',{'class':'thumbnail thumbnail-md'})
divs_modelos

<Response [200]>


[<div assetname="Fan" class="thumbnail thumbnail-md" id="Asset-98" thumbcount="5">
 <table cellpadding="0" cellspacing="0">
 <tr>
 <td><a href="https://www.turbosquid.com/3d-models/3d-fan-1427865"><img alt="3D fan" class="" large_url="https://static.turbosquid.com/Preview/2019/07/19__07_54_52/ventilo_retouche.jpgFD6E519B-D20F-4B19-940D-AEBEDA505B52DefaultHQ.jpg" src="https://static.turbosquid.com/Preview/2019/07/19__07_54_52/ventilo_retouche.jpgFD6E519B-D20F-4B19-940D-AEBEDA505B52Large.jpg"/></a></td>
 </tr>
 </table>
 </div>,
 <div assetname="Pouf Stool" class="thumbnail thumbnail-md" id="Asset-97" thumbcount="5">
 <table cellpadding="0" cellspacing="0">
 <tr>
 <td><a href="https://www.turbosquid.com/3d-models/stool-3d-model-1427816"><img alt="stool 3D model" class="" large_url="https://static.turbosquid.com/Preview/2019/07/19__05_50_22/render_00.pngC01A4CF0-39E4-4076-B79F-D02BD112A72BDefaultHQ.jpg" src="https://static.turbosquid.com/Preview/2019/07/19__05_50_22/render_00.pngC01A4CF0-

In [14]:
# Para obtener el url del modelo necesitamos llegar a:
divs_modelos[0].select('a')[0]['href']

'https://www.turbosquid.com/3d-models/3d-fan-1427865'

Aplicando esto a toda la página de prueba:

In [15]:
urls_modelos = [div.select('a')[0]['href'] for div in divs_modelos]
urls_modelos

['https://www.turbosquid.com/3d-models/3d-fan-1427865',
 'https://www.turbosquid.com/3d-models/stool-3d-model-1427816',
 'https://www.turbosquid.com/3d-models/indoors-test-3d-model-1427813',
 'https://www.turbosquid.com/3d-models/3d-model-apartment-floor-1427811',
 'https://www.turbosquid.com/3d-models/3d-ceramic-coffee-cup-1427808',
 'https://www.turbosquid.com/3d-models/man-head-3d-model-1427729',
 'https://www.turbosquid.com/3d-models/blue-car-model-1427723',
 'https://www.turbosquid.com/3d-models/architecture-test-3d-model-1427708',
 'https://www.turbosquid.com/3d-models/es-studio-3d-model-1427682',
 'https://www.turbosquid.com/3d-models/3d-lighting-1427424',
 'https://www.turbosquid.com/3d-models/office-chair-3d-1427363',
 'https://www.turbosquid.com/3d-models/3d-model-urn-marble-concrete-1427284',
 'https://www.turbosquid.com/3d-models/3d-model-unwrapped-1426378',
 'https://www.turbosquid.com/3d-models/3d-metal-containers-1427125',
 'https://www.turbosquid.com/3d-models/3d-coffee

In [16]:
print('Tenemos %i modelos por página' % len(urls_modelos))

Tenemos 100 modelos por página


In [18]:
print('Dentro de la página hay %i modelos' % (100*7668) )

Dentro de la página hay 766800 modelos


Por lo tanto la función para el scraping de las url de los modelos queda de la siguiente forma:

In [27]:
def web_parser(content):
    soup_prueba = BeautifulSoup(content,'html')
    divs_modelos = soup_prueba.find_all('div',{'class':'thumbnail thumbnail-md'})
    urls_modelos = [div.select('a')[0]['href'] for div in divs_modelos]
    return urls_modelos

Un siguiente paso es obtener información sobre cada uno de los modelos entrando a la url obtenida y realizar scraping de nuevo, en este caso nos interesa obtener la siguiente información:

1. Nombre del modelo, div class productTitle
2. Dueño del modelo, div class productArtist
3. Precio del modelo, div class priceSection price
4. Licencia de uso, div class LicenseUses
5. Fecha de publicación
6. Formatos incluidos, tabla clase exchange
7. Categorias agregadas al modelo, accediendo al div class categorySection
8. Tags agragados al modelo, accediendo al div class tagSection
9. Descripción del modelo, accediendo al div class descriptionSection


In [34]:
# 3D models
# https://www.turbosquid.com/Search/3D-Models?sort_column=a5&sort_order=asc
URL_PATTERN = 'https://www.turbosquid.com/Search/3D-Models?page_num=%i&sort_column=a5&sort_order=asc' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 2 # how many webpages to scrapge
SLEEP_INTERVAL = 1

# Instantiate the IronhackSpider class
project_spider = WebSpider(URL_PATTERN,PAGES_TO_SCRAPE,SLEEP_INTERVAL, content_parser = web_parser)

# Start scraping jobs
urls_modelos_pages = project_spider.kickstart()

Se agregaron 100 url a la lista
Se agregaron 100 url a la lista
2


[['https://www.turbosquid.com/3d-models/3d-fan-1427865',
  'https://www.turbosquid.com/3d-models/stool-3d-model-1427816',
  'https://www.turbosquid.com/3d-models/indoors-test-3d-model-1427813',
  'https://www.turbosquid.com/3d-models/3d-model-apartment-floor-1427811',
  'https://www.turbosquid.com/3d-models/3d-ceramic-coffee-cup-1427808',
  'https://www.turbosquid.com/3d-models/man-head-3d-model-1427729',
  'https://www.turbosquid.com/3d-models/blue-car-model-1427723',
  'https://www.turbosquid.com/3d-models/architecture-test-3d-model-1427708',
  'https://www.turbosquid.com/3d-models/es-studio-3d-model-1427682',
  'https://www.turbosquid.com/3d-models/3d-lighting-1427424',
  'https://www.turbosquid.com/3d-models/office-chair-3d-1427363',
  'https://www.turbosquid.com/3d-models/3d-model-urn-marble-concrete-1427284',
  'https://www.turbosquid.com/3d-models/3d-model-unwrapped-1426378',
  'https://www.turbosquid.com/3d-models/3d-metal-containers-1427125',
  'https://www.turbosquid.com/3d-m

Definimos una nueva clase de Spider.

In [110]:
class ModelsSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, urls_pages, sleep_interval=-1, content_parser=None):
        self.urls_pages = urls_pages
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        self.output = []
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        if str(response) != '<Response [200]>':
            print('Error en la respuesta del servidor',response)
        elif str(response) == '<Response [408]>':
            print('Error en el limite de tiempo de respuesta del servidor')
        elif str(response) == '<Response [429]>':
            print('Error demasiadas peticiones')
        # I didn't find the SSL error but I add a 404 error catching.
        elif str(response) == '<Response [404]>':
            print('No se encontro el contenido')
        else:
            result = self.content_parser(response.content)
            self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        self.output.append(r)
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in self.urls_pages:
            for j in i:
                if self.sleep_interval > 0:
                    time.sleep(self.sleep_interval)
                    self.scrape_url(j)
                else:
                    self.scrape_url(j)
        return self.output

Definimos un nuevo parser para extraer los datos de cada página.

In [111]:
def model_parser(content):
    soup_prueba = BeautifulSoup(content,'html')
    nombre = modelos_prueba.find_all('div',{'class':'productTitle'})[0].select('h1')[0]['content']
    dueño = modelos_prueba.find_all('div',{'class':'productArtist'})[0].text[3:]
    precio = modelos_prueba.find_all('div',{'class':'priceSection price'})[0].text
    licencia = modelos_prueba.find_all('div',{'class':'LicenseUses'})[0].text
    fecha_pub = modelos_prueba.find_all('table',{'class':'SpecificationTable'})[0].select('time')[0]['datetime']
    formatos = modelos_prueba.find_all('table',{'class':'exchange'})[0].text
    categorias = modelos_prueba.find_all('div',{'class':'categorySection'})[0].select('p')[0].text
    tags = modelos_prueba.find_all('div',{'class':'tagSection'})[0].select('a')
    descripcion = modelos_prueba.find_all('div',{'class':'descriptionSection'})[0].select('.descriptionContentParagraph')[0].text
    row_dataset = [nombre,dueño,precio,licencia,fecha_pub,formatos,categorias,tags,descripcion]
    return row_dataset

In [119]:
urls_modelos_pages

[['https://www.turbosquid.com/3d-models/3d-fan-1427865',
  'https://www.turbosquid.com/3d-models/stool-3d-model-1427816',
  'https://www.turbosquid.com/3d-models/indoors-test-3d-model-1427813',
  'https://www.turbosquid.com/3d-models/3d-model-apartment-floor-1427811',
  'https://www.turbosquid.com/3d-models/3d-ceramic-coffee-cup-1427808',
  'https://www.turbosquid.com/3d-models/man-head-3d-model-1427729',
  'https://www.turbosquid.com/3d-models/blue-car-model-1427723',
  'https://www.turbosquid.com/3d-models/architecture-test-3d-model-1427708',
  'https://www.turbosquid.com/3d-models/es-studio-3d-model-1427682',
  'https://www.turbosquid.com/3d-models/3d-lighting-1427424',
  'https://www.turbosquid.com/3d-models/office-chair-3d-1427363',
  'https://www.turbosquid.com/3d-models/3d-model-urn-marble-concrete-1427284',
  'https://www.turbosquid.com/3d-models/3d-model-unwrapped-1426378',
  'https://www.turbosquid.com/3d-models/3d-metal-containers-1427125',
  'https://www.turbosquid.com/3d-m

In [117]:
SLEEP_INTERVAL = 0.5

# Instantiate the IronhackSpider class
project_spider_models = ModelsSpider(urls_modelos_pages,SLEEP_INTERVAL, content_parser = model_parser)

# Start scraping jobs
rows_dataset = project_spider_models.kickstart()

Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregaron 9 url a la lista
Se agregar

In [118]:
rows_dataset

[['3D Fan',
  'Akumax Maxime',
  '\nFree\n',
  ' - All Extended Uses',
  '2019-07-19',
  '\n\n\nOBJ \n\n',
  '',
  [<a class="FPKeywords" href="https://www.turbosquid.com/Search/3D-Models/fan">Fan</a>,
   <a class="FPKeywords" href="https://www.turbosquid.com/Search/3D-Models/ventilateur">ventilateur</a>],
  'Vintage fan'],
 ['3D Fan',
  'Akumax Maxime',
  '\nFree\n',
  ' - All Extended Uses',
  '2019-07-19',
  '\n\n\nOBJ \n\n',
  '',
  [<a class="FPKeywords" href="https://www.turbosquid.com/Search/3D-Models/fan">Fan</a>,
   <a class="FPKeywords" href="https://www.turbosquid.com/Search/3D-Models/ventilateur">ventilateur</a>],
  'Vintage fan'],
 ['3D Fan',
  'Akumax Maxime',
  '\nFree\n',
  ' - All Extended Uses',
  '2019-07-19',
  '\n\n\nOBJ \n\n',
  '',
  [<a class="FPKeywords" href="https://www.turbosquid.com/Search/3D-Models/fan">Fan</a>,
   <a class="FPKeywords" href="https://www.turbosquid.com/Search/3D-Models/ventilateur">ventilateur</a>],
  'Vintage fan'],
 ['3D Fan',
  'Akumax 