# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [1]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`

[32mimport [39m[36m$ivy.$                              [39m

In [2]:
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [5]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

In [4]:
import $ivy.`org.json4s::json4s-jackson:3.4.0`
import org.json4s.JsonDSL._
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats

[32mimport [39m[36m$ivy.$                                 
[39m
[32mimport [39m[36morg.json4s.JsonDSL._
[39m
[32mimport [39m[36morg.json4s._
[39m
[32mimport [39m[36morg.json4s.jackson.JsonMethods._
[39m
[36mformats[39m: [32mDefaultFormats[39m.type = org.json4s.DefaultFormats$@6f7917c2

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [6]:
val url1 = "https://www.reddit.com/r/Android/hot/.json?count=10"
val url2 = "https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"

[36murl1[39m: [32mString[39m = [32m"https://www.reddit.com/r/Android/hot/.json?count=10"[39m
[36murl2[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m

In [9]:
class GetTextURL(){
    // Obtener texto desde una url
    def queryURL(url: String, urlType: String): Seq[String] = {
        try{
            val response: HttpResponse[String] = Http(url)
              .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
              .asString
            val stringBody = response.body
            urlType match {
                case "rss" => {
                    val xml = XML.loadString(stringBody)
                    // Extract text from title and description
                    (xml \\ "item").map { item => ((item \ "title").text + " " + (item \ "description").text) }
                }
                case "reddit" => {
                    // parse Reddit feed in JSON
                    val result = (parse(stringBody) \ "data" \ "children" \ "data")
                         .extract[List[Map[String, Any]]]
                    // Parsear JSON
                    val filterContent = result.flatten.filter{case (v , _) => v == "title" || v == "selftext" }.map(x => x._2.toString)
                    val pattern = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]".r
                    filterContent.map(x => pattern.replaceAllIn(x,"")).toSeq
                }
            }
        }catch{
            case e: Exception => List()
        }    
    }
}

defined [32mclass[39m [36mGetTextURL[39m

In [10]:
val text = new GetTextURL

[36mtext[39m: [32mGetTextURL[39m = ammonite.$sess.cmd8$Helper$GetTextURL@110f17ec

In [23]:
val redditText = text.queryURL(url1,"reddit")
val rssText = text.queryURL(url2)

[36mrss[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"""Note 1. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.]()

This weekly Sunday thread is for you to let off some steam and speak out about whatever complaint you might have about:  

* Your device.  

* Your carrier.  

* Your device's manufacturer.  

* An app  

* Any other company

***  

**Rules**  

1) Please do not target any individuals or try to name/shame any individual. If you hate Google/Samsung/HTC etc. for one thing that is fine, but do not be rude to an individual app developer.

2) If you have a suggestion to solve another user's issue, please leave a comment but be sure it's constructive! We do not want any flame-wars.  

3) Be respectful of other's opinions. Even if you feel that somebody is "wrong" you don't have to go out of your way to prove them wrong. Disagree politely, and move on."""[39m,
  [32m"Sunday Rant/Rage (May 09 2021) - Your weekly complaint thread!

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [24]:
class NERModel() {
    // Variables Necesarias para crear modelo
    val STOPWORDS = Seq (
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
        "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
        "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
        "their", "theirs", "themselves", "what", "which", "who", "whom",
        "this", "that", "these", "those", "am", "is", "are", "was", "were",
        "be", "been", "being", "have", "has", "had", "having", "do", "does",
        "did", "doing", "a", "an", "the", "and", "but", "if", "or",
        "because", "as", "until", "while", "of", "at", "by", "for", "with",
        "about", "against", "between", "into", "through", "during", "before",
        "after", "above", "below", "to", "from", "up", "down", "in", "out",
        "off", "over", "under", "again", "further", "then", "once", "here",
        "there", "when", "where", "why", "how", "all", "any", "both", "each",
        "few", "more", "most", "other", "some", "such", "no", "nor", "not",
        "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
        "will", "just", "don", "should", "now", "on",
        // Contractions without '
        "im", "ive", "id", "Youre", "youd", "youve",
        "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
        "theyre", "theyd", "theyve",
        "shouldnt", "couldnt", "musnt", "cant", "wont",
        // Common uppercase words
        "hi", "hello"
    )
    val punctuationSymbols = ".,()!?;:'`´\n"
    val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Aplicar el Modelo a los datos (simplemente es aplicar la funcion a la lista de textos)
    def getNEsSingle(text: String): Seq[String] =
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq

    def getNEs(textList: Seq[String]): Seq[Seq[String]] = textList.map(getNEsSingle)
    
    // Contar y ordenar las entidades
    def countandsort(textL : Seq[Seq[String]]): List[(String, Int)] = {
        val counts: Map[String, Int] = textL.flatten
          .foldLeft(Map.empty[String, Int]) {
             (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
        counts.toList
          .sortBy(_._2)(Ordering[Int].reverse)
    }
}
       

defined [32mclass[39m [36mNERModel[39m

In [25]:
val model = new NERModel

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd23$Helper$NERModel@7a38d15f

### 2.2 Aplicar el "Modelo" a los datos

In [26]:
val result = model.getNEs(rss)

[36mresult[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"IRC"[39m,
    [32m"Telegram"[39m,
    [32m"Sunday"[39m,
    [32m"Please"[39m,
    [32m"Google/Samsung/HTC"[39m,
    [32m"Even"[39m,
    [32m"Disagree"[39m
  ),
  [33mArrayBuffer[39m([32m"Sunday"[39m, [32m"Rant/Rage"[39m, [32m"May"[39m),
  [33mArrayBuffer[39m(
    [32m"Hey"[39m,
    [32m"Participation"[39m,
    [32m"Google"[39m,
    [32m"Forms"[39m,
    [32m"Email"[39m,
    [32m"Responses"[39m,
    [32m"Well"[39m,
    [32m"POLL]Edit"[39m
  ),
  [33mArrayBuffer[39m([32m"Community"[39m, [32m"Feedback"[39m, [32m"Poll"[39m, [32m"February"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m([32m"Xiaomi"[39m, [32m"Mi"[39m, [32m"Mix"[39m, [32m"Folds"[39m, [32m"PC"[39m, [32m"Mode"[39m),
  [33mArrayBuffer[39m(),
  [33mArrayBuffer[39m(
    [32m"PSA"[39m,
    [32m"Qu

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [27]:
val CountAndSortedNEs = model.countandsort(result)

[36mCountAndSortedNEs[39m: [32mList[39m[([32mString[39m, [32mInt[39m)] = [33mList[39m(
  ([32m"GoPro"[39m, [32m18[39m),
  ([32m"Plus"[39m, [32m9[39m),
  ([32m"Android"[39m, [32m8[39m),
  ([32m"Google"[39m, [32m6[39m),
  ([32m"ADB"[39m, [32m5[39m),
  ([32m"OS"[39m, [32m5[39m),
  ([32m"Windows"[39m, [32m5[39m),
  ([32m"HDR"[39m, [32m4[39m),
  ([32m"Samsung"[39m, [32m4[39m),
  ([32m"MIUI"[39m, [32m4[39m),
  ([32m"Note"[39m, [32m3[39m),
  ([32m"Fastboot\""[39m, [32m3[39m),
  ([32m"App"[39m, [32m3[39m),
  ([32m"Pro"[39m, [32m3[39m),
  ([32m"May"[39m, [32m3[39m),
  ([32m"Join"[39m, [32m3[39m),
  ([32m"AI"[39m, [32m3[39m),
  ([32m"Photos"[39m, [32m3[39m),
  ([32m"However"[39m, [32m3[39m),
  ([32m"Users"[39m, [32m2[39m),
  ([32m"Xperia"[39m, [32m2[39m),
  ([32m"PSA"[39m, [32m2[39m),
  ([32m"Galaxy"[39m, [32m2[39m),
  ([32m"PC"[39m, [32m2[39m),
  ([32m"Hey"[39m, [32m2[39m),
  ([32m"SDK