In [None]:
# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [1]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`

[32mimport [39m[36m$ivy.$                              [39m

In [2]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [3]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [None]:
class RSS(){
    // Obtener texto desde una url
    def getRSSText (url: String): Seq[String] ={
        val response: HttpResponse[String] = Http(url)
          .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
          .asString
        val xmlString = response.body
        // convert the `String` to a `scala.xml.Elem`
        val xml = XML.loadString(xmlString)
        // Extract text from title and description
        (xml \\ "item").map { item => ((item \ "title").text ++ " " ++ (item \ "description").text) }
    }
}

In [7]:
val rss = new RSS

[36mrss[39m: [32mRSS[39m = ammonite.$sess.cmd5$Helper$RSS@2c73a5d2

In [8]:
val rssText = rss.getRSSText("https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc")

[36mrssText[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"Northwestern faculty say they\u2019re \u2018alarmed\u2019 and \u2018embarrassed\u2019 by athletic director Mike Polisky\u2019s hiring and plan to protest at President Morton Schapiro\u2019s house Six female Northwestern faculty members sent an open letter Wednesday to Provost Kathleen Hagerty demanding greater transparency about the hiring of athletic director Mike Polisky, and they\u2019re planning a picket Friday that will march from campus to President Morton Schapiro\u2019s home to express opposition to the hiring."[39m,
  [32m"3 takeaways from the Chicago Cubs\u2019 series sweep of the Los Angeles Dodgers, including David Ross\u2019 savvy and Javier B\u00e1ez stepping up in big moments The Chicago Cubs showed their resiliency in sweeping the Los Angeles Dodgers in a three-game series at Wrigley Field. \u201cHonestly, we\u2019re riding a roller coaster right now,\" said Anthony Rizzo, whose RBI single in

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [12]:
class NERModel() {
    // Variables Necesarias para crear modelo
    val STOPWORDS = Seq (
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
        "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
        "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
        "their", "theirs", "themselves", "what", "which", "who", "whom",
        "this", "that", "these", "those", "am", "is", "are", "was", "were",
        "be", "been", "being", "have", "has", "had", "having", "do", "does",
        "did", "doing", "a", "an", "the", "and", "but", "if", "or",
        "because", "as", "until", "while", "of", "at", "by", "for", "with",
        "about", "against", "between", "into", "through", "during", "before",
        "after", "above", "below", "to", "from", "up", "down", "in", "out",
        "off", "over", "under", "again", "further", "then", "once", "here",
        "there", "when", "where", "why", "how", "all", "any", "both", "each",
        "few", "more", "most", "other", "some", "such", "no", "nor", "not",
        "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
        "will", "just", "don", "should", "now", "on",
        // Contractions without '
        "im", "ive", "id", "Youre", "youd", "youve",
        "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
        "theyre", "theyd", "theyve",
        "shouldnt", "couldnt", "musnt", "cant", "wont",
        // Common uppercase words
        "hi", "hello"
    )
    val punctuationSymbols = ".,()!?;:'`´\n"
    val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Aplicar el Modelo a los datos (simplemente es aplicar la funcion a la lista de textos)
    def getNEsSingle(text: String): Seq[String] =
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq

    def getNEs(textList: Seq[String]): Seq[Seq[String]] = textList.map(getNEsSingle)
    
    // Contar y ordenar las entidades
    def countandsort(textL : Seq[Seq[String]]): List[(String, Int)] = {
        val counts: Map[String, Int] = textL.flatten
          .foldLeft(Map.empty[String, Int]) {
             (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
        counts.toList
          .sortBy(_._2)(Ordering[Int].reverse)
    }
}
       

defined [32mclass[39m [36mNERModel[39m

In [13]:
val model = new NERModel

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd11$Helper$NERModel@4caf1176

### 2.2 Aplicar el "Modelo" a los datos

In [14]:
val result = model.getNEs(rssText)

[36mresult[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArraySeq[39m(
    [32m"Northwestern"[39m,
    [32m"Mike"[39m,
    [32m"Polisky\u2019s"[39m,
    [32m"President"[39m,
    [32m"Morton"[39m,
    [32m"Schapiro\u2019s"[39m,
    [32m"Six"[39m,
    [32m"Northwestern"[39m,
    [32m"Wednesday"[39m,
    [32m"Provost"[39m,
    [32m"Kathleen"[39m,
    [32m"Hagerty"[39m,
    [32m"Mike"[39m,
    [32m"Polisky"[39m,
    [32m"Friday"[39m,
    [32m"President"[39m,
    [32m"Morton"[39m,
    [32m"Schapiro\u2019s"[39m
  ),
  [33mArraySeq[39m(
    [32m"Chicago"[39m,
    [32m"Cubs\u2019"[39m,
    [32m"Los"[39m,
    [32m"Angeles"[39m,
    [32m"Dodgers"[39m,
    [32m"David"[39m,
    [32m"Ross\u2019"[39m,
    [32m"Javier"[39m,
    [32m"B\u00e1ez"[39m,
    [32m"Chicago"[39m,
    [32m"Cubs"[39m,
    [32m"Los"[39m,
    [32m"Angeles"[39m,
    [32m"Dodgers"[39m,
    [32m"Wrigley"[39m,
    [32m"Field"[39m,


## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [15]:
val CountAndSortedNEs = model.countandsort(result)

[36mCountAndSortedNEs[39m: [32mList[39m[([32mString[39m, [32mInt[39m)] = [33mList[39m(
  ([32m"Chicago"[39m, [32m41[39m),
  ([32m"Bears"[39m, [32m12[39m),
  ([32m"Los"[39m, [32m11[39m),
  ([32m"Angeles"[39m, [32m11[39m),
  ([32m"Cubs"[39m, [32m11[39m),
  ([32m"Dodgers"[39m, [32m11[39m),
  ([32m"White"[39m, [32m7[39m),
  ([32m"Sox"[39m, [32m7[39m),
  ([32m"Reds"[39m, [32m6[39m),
  ([32m"Cincinnati"[39m, [32m6[39m),
  ([32m"Field"[39m, [32m6[39m),
  ([32m"Justin"[39m, [32m6[39m),
  ([32m"May"[39m, [32m5[39m),
  ([32m"Tuesday"[39m, [32m5[39m),
  ([32m"Tony"[39m, [32m5[39m),
  ([32m"Kyle"[39m, [32m4[39m),
  ([32m"Game"[39m, [32m4[39m),
  ([32m"COVID-19"[39m, [32m4[39m),
  ([32m"Photos"[39m, [32m4[39m),
  ([32m"Wrigley"[39m, [32m4[39m),
  ([32m"Column"[39m, [32m3[39m),
  ([32m"Houston"[39m, [32m3[39m),
  ([32m"Cubs\u2019"[39m, [32m3[39m),
  ([32m"QB"[39m, [32m3[39m),
  ([32m"Wednesday"