<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Desafio N°1

Valentín Pertierra

#### Consigna:

**1**. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos.
Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido
la similaridad según el contenido del texto y la etiqueta de clasificación.

**2**. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación
(f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros
de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial
y ComplementNB.

**3**. Transponer la matriz documento-término. De esa manera se obtiene una matriz
término-documento que puede ser interpretada como una colección de vectorización de palabras.
Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares. **La elección de palabras no debe ser al azar para evitar la aparición de términos poco interpretables, elegirlas "manualmente"**.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV

from sklearn.datasets import fetch_20newsgroups
import numpy as np

from IPython.display import display, Markdown
import pandas as pd

### Carga de datos

In [None]:
# cargamos los datos (ya separados de forma predeterminada en train y test)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

### 1) Vectorización

In [None]:
# Instancio un vectorizador
# ver diferentes parámetros de instanciación en la documentación de sklearn
tfidfvect = TfidfVectorizer()

In [None]:
# en el atributo `data` accedemos al texto
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [None]:
# con la interfaz habitual de sklearn podemos fitear el vectorizador
# (obtener el vocabulario y calcular el vector IDF)
# y transformar directamente los datos
X_train = tfidfvect.fit_transform(newsgroups_train.data)
# `X_train` la podemos denominar como la matriz documento-término

In [None]:
# recordar que las vectorizaciones por conteos son esparsas
# por ello sklearn convenientemente devuelve los vectores de documentos
# como matrices esparsas
print(type(X_train))
print(f'shape: {X_train.shape}')
print(f'cantidad de documentos: {X_train.shape[0]}')
print(f'tamaño del vocabulario (dimensionalidad de los vectores): {X_train.shape[1]}')

<class 'scipy.sparse._csr.csr_matrix'>
shape: (11314, 101631)
cantidad de documentos: 11314
tamaño del vocabulario (dimensionalidad de los vectores): 101631


In [None]:
# una vez fiteado el vectorizador, podemos acceder a atributos como el vocabulario
# aprendido. Es un diccionario que va de términos a índices.
# El índice es la posición en el vector de documento.
tfidfvect.vocabulary_['car']

25775

In [None]:
# es muy útil tener el diccionario opuesto que va de índices a términos
idx2word = {v: k for k,v in tfidfvect.vocabulary_.items()}

In [None]:
# en `y_train` guardamos los targets que son enteros
y_train = newsgroups_train.target
y_train[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [None]:
# hay 20 clases correspondientes a los 20 grupos de noticias
print(f'clases {np.unique(newsgroups_test.target)}')
newsgroups_test.target_names

clases [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Similaridad de documentos

Tomar 5 documentos al azar y medir similaridad con el resto de los documentos. Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido la similaridad según el contenido del texto y la etiqueta de clasificación

In [None]:
# Selecciono los documentos a analizar
idxs = [450, 1300, 4200, 7850, 9900]

resultados = []
textos = []

for idx in idxs:

  # Obtengo la distancia cosenso con el resto del dataset y me quedo con los 5 mas similares
  cossim = cosine_similarity(X_train[idx], X_train)[0]

  cossimVal = np.sort(cossim)[::-1][1:6]
  simIdx = np.argsort(cossim)[::-1][1:6]

  textos.append([newsgroups_train.data[idx],
                 newsgroups_train.data[simIdx[0]],
                 newsgroups_train.data[simIdx[1]],
                 newsgroups_train.data[simIdx[2]],
                 newsgroups_train.data[simIdx[3]],
                 newsgroups_train.data[simIdx[4]]])

  # Guardo resultados para comparar
  resultados.append([newsgroups_train.target_names[y_train[idx]],
                     f'Idx:{simIdx[0]} cossim:{round(cossimVal[0], 2)} {newsgroups_train.target_names[y_train[simIdx[0]]]}',
                     f'Idx:{simIdx[1]} cossim:{round(cossimVal[1], 2)} {newsgroups_train.target_names[y_train[simIdx[1]]]}',
                     f'Idx:{simIdx[2]} cossim:{round(cossimVal[2], 2)} {newsgroups_train.target_names[y_train[simIdx[2]]]}',
                     f'Idx:{simIdx[3]} cossim:{round(cossimVal[3], 2)} {newsgroups_train.target_names[y_train[simIdx[3]]]}',
                     f'Idx:{simIdx[4]} cossim:{round(cossimVal[4], 2)} {newsgroups_train.target_names[y_train[simIdx[4]]]}'])


In [None]:
# Muestro los resultados en una tabla
df = pd.DataFrame(
    resultados,
    columns=['Target','Documento Similar 1','Documento Similar 2','Documento Similar 3','Documento Similar 4', 'Documento Similar 5'],
    index=idxs)

display(df)


Unnamed: 0,Target,Documento Similar 1,Documento Similar 2,Documento Similar 3,Documento Similar 4,Documento Similar 5
450,comp.sys.ibm.pc.hardware,Idx:2075 cossim:0.43 comp.sys.ibm.pc.hardware,Idx:943 cossim:0.31 comp.sys.ibm.pc.hardware,Idx:1339 cossim:0.28 comp.sys.ibm.pc.hardware,Idx:7591 cossim:0.27 comp.sys.ibm.pc.hardware,Idx:685 cossim:0.25 comp.sys.ibm.pc.hardware
1300,sci.med,Idx:8550 cossim:0.54 sci.med,Idx:8660 cossim:0.51 sci.med,Idx:2189 cossim:0.46 sci.med,Idx:3808 cossim:0.46 sci.med,Idx:3652 cossim:0.45 sci.med
4200,rec.sport.hockey,Idx:6859 cossim:0.7 rec.sport.hockey,Idx:3524 cossim:0.33 rec.sport.hockey,Idx:9635 cossim:0.32 rec.sport.hockey,Idx:9625 cossim:0.29 rec.sport.hockey,Idx:3313 cossim:0.26 rec.sport.hockey
7850,sci.crypt,Idx:6935 cossim:0.34 sci.crypt,Idx:4498 cossim:0.31 sci.crypt,Idx:9007 cossim:0.31 sci.crypt,Idx:8445 cossim:0.3 sci.crypt,Idx:5612 cossim:0.3 sci.crypt
9900,sci.space,Idx:1761 cossim:0.35 sci.space,Idx:9986 cossim:0.34 sci.space,Idx:7545 cossim:0.32 sci.space,Idx:3285 cossim:0.32 sci.space,Idx:11198 cossim:0.3 sci.space


Analizando la tabla se observa que los documentos seleccionados y los 5 documentos mas similares a él, en todos los casos, pertenecen a la misma clase.

### Comparo el texto de los documentos
Imprimo los textos en celdas independietes para poder colapsarlas y que queda mas ordenado el notebook

In [None]:
display(Markdown(f'**Documento seleccionado:** {idxs[0]}\n'))
print(textos[0][0])

for i in range(len(textos[0])-1):
  display(Markdown(f'**Documento similar {i+1}:**\n'))
  print(textos[0][i+1])
  print('\n')

**Documento seleccionado:** 450








In addition to startup time, I leave things running because my PC doubles as 
a fax machine. 

However, this is off the original subject. I didn't get the replies on BIOS, 
CMOS, and DOS clock/date logic. All I know is that I've been running this way 
for many months and it is only recently, the last month, that I have noticed 
the intermittent clock problem. As I stated, it is not always the date that 
doesn't roll forward, sometimes I notice that the clock is several minutes 
behind where it ought to be. 

When unattended, the following are generally running minimized in Win 3.1:

Clock, WinFax Pro 3.0, Print Manager, MS-Word 1.1, File Manager, Program 
Manager

A random screen saver is generally running too.




**Documento similar 1:**








I've started to notice the same thing myself. I'm running DOS 5 and Win 3.1 so
I can fix it from the Windows Control Panel. At times it is the date, at
others the clock seems to be running several minutes behind where it should
be.

If you find out I'd like to know also. Oh, and I also leave my system running
all the time.
                                                                    




**Documento similar 2:**



I bet it suddenly started sticking when you  started leaving the PC running the
menu all night.  There is a limitation/bug in the date roll-over software in
PC's that means you have to be doing something like waiting for keyboard input
via a DOS call rather than a BIOS call (as menus often use) otherwise the code
to update the date after midnight never gets called. 

Somebody might be able to correct the details in case I've mis-rememberred
them, but I think you have to change the menu program (if you have the sources)
or add a TSR or system patch or something.  As far as I know the CMOS clock
keeps the right time (in fact about 7 seconds/day better than DOS's clock).




**Documento similar 3:**








Did I once hear that in order for the date to advance, something, like a 
clock, *has* to make a Get Date system call? Apparently, the clock
hardware interrupt and BIOS don't do this (date advance) automatically. The
Get Date call notices that a "midnight reset" flag has been set, and then
then advances the date.

Anybody with more info?




**Documento similar 4:**


Anybody seen the date get stuck?

I'm running MS-DOS 5.0 with a menu system alive all the time.  The machine
is left running all the time.

Suddenly, the date no longer rolls over.  The time is (reasonably) accurate
allways, but we have to change the date by hand every morning.  This involves
exiting the menu system to get to DOS.

Anyone have the slightest idea why this should be?  Even a clue as to whether
the hardware (battery? CMOS?) or DOS is broken?




**Documento similar 5:**


  [stuff deleted]

There are two 'problems':
(1) the BIOS TOD routine which updates the BIOS clock uses only 1 bit
    for day increment, so a second wrapping of the clock past midnight
    will get lost if no one calls the BIOS to read the clock in the
    meantime, and
(2) the BIOS resets the day wrap indicator on the first 'get date'
    call from ANYBODY (after the wrap indicator has been set). So
    unless the first BIOS 'get date' call after midnight is done by
    the DOS 'kernel' (which is the only part of DOS which knows how to
    increment the date, the day wrap indication is normally lost.
My guess is that Kevin's 'menu' system uses BIOS calls to read the
clock (in order to display the time), and is hence the entity which
causes the day wrap indication to get lost. Even if the 'menu' system
'notices' the day 'wrap' (which I think is indicated by a non-zero
value in AL), there really isn't any particularly good way to tell DOS
about it, so that DOS can update the day. The m

In [None]:
display(Markdown(f'**Documento seleccionado:** {idxs[1]}\n'))
print(textos[1][0])

for i in range(len(textos[0])-1):
  display(Markdown(f'**Documento similar {i+1}:**\n'))
  print(textos[1][i+1])
  print('\n')

**Documento seleccionado:** 1300




The flushing is due to vascular dilation, part of a migraine attack.
Some people event get puffy and swollen.  As long as you are careful
you can see well enough to avoid getting hit in the face or eye by
the ball, migraine will not hurt your health.



-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 


**Documento similar 1:**




So just what was it you wanted to say?



-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 




**Documento similar 2:**




By law, they would not be allowed to do that anyhow.




-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 




**Documento similar 3:**



Senile keratoses.  Have nothing to do with the liver.


-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 




**Documento similar 4:**



There is eye dominance same as handedness (and usually for the
same side).  It has nothing to do with refractive error, however.


-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 




**Documento similar 5:**




"Diet Evangelist".  Good term.  Fits Atkins to a "T".  


-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 




In [None]:
display(Markdown(f'**Documento seleccionado:** {idxs[2]}\n'))
print(textos[2][0])

for i in range(len(textos[0])-1):
  display(Markdown(f'**Documento similar {i+1}:**\n'))
  print(textos[2][i+1])
  print('\n')

**Documento seleccionado:** 4200


DALLAS HELPS HAWKS STAY IN MONCTON
After announcing that they would pull their affiliation out
of Moncton, the Winnipeg Jets changed their mind. 

The Jets announced the move when they said that they would be slashing
their minor league roster from 20-something to around a dozen; and they
wanted to share with an existing AHL or IHL franchise.

Enter the Dallas Lone Stars. Dallas agreed to supply the remaining
6 or 8 players to the Moncton franchise. Thus keeping the Hawks
in the New Brunswick city.

The deal is for one year and will be extended to three years if
the season ticket base increases to over 3000. The Hawks only sold
1400 for this year.

SAINT JOHN FLAMES OFFICIAL
The Calgary Flames have officially signed a deal with the city of
Saint John, NB. The Saint John Blue Flames will play in the 6200
Exhibition Center. The Flames still have to apply for an expansion
frnachise from the AHL but are expected to have no trouble.

CAPS FOLLOW JACKS TO MAINE
Despite rumors to the contrary

**Documento similar 1:**


Here is a review of some of the off-ice things that have
affected the AHL this year.


ST JOHN'S MAPLE LEAFS PROBLEMS
The St John's Maple Leafs sophomore season has been plagued by
problems. On-ice, the Leafs won the Atlantic Division title but
off ice was less happy. A strike by public workers has forced the
leafs out of the Newfoundland city for much of the last half of
the seaosn (since mid-Jan). They have played "home" games in places
like Montreal, Cornwall and Charlottetown. Their playoff "home"
games will be played in the Metro Center in Halifax, NS. One
demostration got violent. Workers attacked a Leafs' bus and
rocked it and broke windows in the St John's Memorial Stadium.
Despite the problems, Toronto officials insist that the Leafs
will return to St John's once the strike ends.
SENATORS SOLD
The New Haven Senators have been sold by Peter Shipman to
the Ottawa Senators NHL organization. They are the only Canadian
NHL team with an American AHL affiliate, and have made it clear

**Documento similar 2:**


: >
: >ATLANTIC DIVISION
: >	
: >	ST JOHN'S MAPLE LEAFS VS MONCTON HAWKS
: >	MONCTON HAWKS
: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,
: >defensive, good goaltending. John Leblanc and Stu Barnes are the only
: >noticable guns on the team. But the defense is top notch and 
: >Mike O'Neill is the most underrated goalie in the league.
: >

: Bri, as I have tried to tell you since 2 February, Michael O'Neill
: might be the most underrated goalie in the AHL, but he ISN'T in the
: AHL.  He's on the Winnipeg Jets' injury list, as he has been since
: his first NHL start against the Ottawa Senators.  He's out until
: next year after surgery to repair a shoulder separation.

: Stu Barnes might be an AHL gun for the Hawks, but he's now the third
: line center with the Jets, and has been since mid January or so.

Sorry, my memory is gone. I thought that O'Neill got sent back
down in February but I must have been given incorrect info. I guess
this says it all about Monc

**Documento similar 3:**


Archive-name: hockey-faq

rec.sport.hockey answers to Frequently Asked Questions and other news:
 
Contents:

0. New Info.
1. NHL
2. NHL Minor Leagues
3. College Hockey (North America)
4. Other leagues (e.g. Europe, Canada Cup tournament)
5. E-mail files
6. USENET Hockey Pool
7. Up-coming Dates
8. Answers to some frequently asked questions
9. Miscellaneous
 
 Send comments, suggestions and criticisms regarding this FAQ list via e-
mail to hamlet@u.washington.edu.
 
--------------------------------------------------------------------------
 
 0. New Info.
 
 This section will describe additions since the last post so that you can 
decide if there is anything worth reading. Paragraphs containing new 
information will be preceded by two asterisks (**).

 1.: New Anaheim contact, Winnipeg to keep affiliate in Moncton.
 2.: New Milwaukee contact, IHL broadcaster of the year named, Rheaume to 
start against Cyclones, San Diego sets record.
 3.: Ticket info included for 1994 NCAA Division I C

**Documento similar 4:**



Gee, you'd think Winnipeg would be tops on that list, what with 8 regulars
being European.



Well, being a Jet fan, I sometimes wish that Bure would get knocked silly
too.  (Nothing serious, just enough to keep him out of a game. :)



In most cases, the owners have very little to do with it.  They give their
general managers one order when it comes to the draft...find me the best
players so that our team will win the Stanley Cup.  Whether that player is
in Kindersley, Saskatchewan or Chelyabinsk, Russia, if the GM believes him
to be the better player, the GM should be drafting him.

Where do you get off calling the NHL THEIR league, when referring to Canadian
players.  It doesn't belong to them, it belongs to the owners.  The owners
can do what they want.  While a 'Canadian content' rule might be enforcable
here in Canada, there is enough doubt that it would be enforcable in the US
that the CFL (sorry for the football reference) didn't even TRY to push their
import ratio rule on the

**Documento similar 5:**


Smythe Division
---------------

Vancouver vs. Winnipeg - Jets in 7
The Jets have played the Canucks tough the last three games.  Everyone is
healthy for the Jets.  I'm biased.  :)

Calgary vs. Los Angeles - Flames in 6
From what I have seen, the Kings have looked flat lately.  I just can't see
them getting by the Flames.

Final- Jets in 6.
The Jets haven't lost to the Flames in '93.  They will, but it will be a
close series that will come down to how well Roberts has recovered.  I
don't think he'll be 100%, and while it will help, it won't be enough.

Norris Division
---------------

Chicago vs. St. Louis/Minnesota
Chicago in 6 against the Blues, 7 against the Stars.  

Detroit vs. Toronto - Wings in 6.
The Wings should be able to shutdown Gilmour and Andreychuk.  Chelvadae is
more experienced than Potvin.

Final - Hawks in 7.  Brutal series.  Probert and Chelios will go at it.
Belfour is better than Chelvadae, IMHO.

Conference Final - Hawks in 6.  It hurts, but the Hawks are more ex

In [None]:
display(Markdown(f'**Documento seleccionado:** {idxs[3]}\n'))
print(textos[3][0])

for i in range(len(textos[0])-1):
  display(Markdown(f'**Documento similar {i+1}:**\n'))
  print(textos[3][i+1])
  print('\n')

**Documento seleccionado:** 7850


I have a question about digital communications encryption:

	The Fact Sheet mentioned encryption/decryption microcircuitry with 
special "keys" for law enforcement for wire tapping purposes.

	If I wanted to, couldn't I develop  encryption of my own?  That
is, if me and a partner in crime had unique Encryption/decryption
devices installed before the "tappable" one, couldn't we circumvent
the "keys" system?  Or replace it?

	I'd be really interested in knowing how the E/D microcircuits might
be made to prevent such befuddlement! (Laymans' Language, please! maybe a bit
technical...)

Please E-mail to me, as I'm not in Net News as much as I'd like to be!


Pete
deuelpm@craft.camp.clarkson.edu



**Documento similar 1:**


Note:     The following was released by the White House today in
          conjunction with the announcement of the Clipper Chip
          encryption technology.

                           FACT SHEET

                  PUBLIC ENCRYPTION MANAGEMENT

The President has approved a directive on "Public Encryption
Management."  The directive provides for the following:

Advanced telecommunications and commercially available encryption
are part of a wave of new computer and communications technology. 
Encryption products scramble information to protect the privacy of
communications and data by preventing unauthorized access. 
Advanced telecommunications systems use digital technology to
rapidly and precisely handle a high volume of communications. 
These advanced telecommunications systems are integral to the
infrastructure needed to ensure economic competitiveness in the
information age.

Despite its benefits, new communications technology can also
frustrate lawful government electronic sur

**Documento similar 2:**


This document is in the anonymous ftp directory at NIST.  Looks to me
like the other shoe has dropped.

	Jim Gillogly
	Trewesday, 25 Astron S.R. 1993, 17:00

-------------------

Note:  This file will also be available via anonymous file
transfer from csrc.ncsl.nist.gov in directory /pub/nistnews and
via the NIST Computer Security BBS at 301-948-5717.
     ---------------------------------------------------

                         THE WHITE HOUSE

                  Office of the Press Secretary

_________________________________________________________________

For Immediate Release                           April 16, 1993


                STATEMENT BY THE PRESS SECRETARY


The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.

The initiative will involve the creation of new products to
ac

**Documento similar 3:**


I saw this article posted in a local newsgroup.  I haven't seen it,
or any followup traffic relating to it in these groups or other groups
which I subscribe to.  So, I am posting it here so others can read it,
check it out, and comment on it, and provide ideas for handling these
sorts of things.

I have no verification to the accuracy or lack of accuracy of this
article, but if accurate, I find it extremely disturbing, especially in
light of various abuses of the SSN number regarding privacy, (I understand
it is now to be required in CA to renew a drivers license, or to register
a car) and other proposals regarding 'smart' national Identity Cards,
wiretap proposals, and such.  One simply wonders what other gems are in
the wings ready to be sprung on the people by our government.  Perhaps
suggestions and ideas for preventing this and other such proposals from
acquiring the force of law would be useful.  The cost simply outweighs
any possible benefits, IMO.

BTW, reading this makes me th

**Documento similar 4:**


It looks like Dorothy Denning's wrong-headed ideas have gotten to the
Administration even sooner than we feared. It's time to make sure they
hear the other side of the story, and hear it loudly!

Phil



------- Forwarded Message

Subject: text of White House announcement and Q&As on clipper chip encryption

Note:  This file will also be available via anonymous file
transfer from csrc.ncsl.nist.gov in directory /pub/nistnews and
via the NIST Computer Security BBS at 301-948-5717.
     ---------------------------------------------------

                         THE WHITE HOUSE

                  Office of the Press Secretary

_________________________________________________________________

For Immediate Release                           April 16, 1993


                STATEMENT BY THE PRESS SECRETARY


The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
co

**Documento similar 5:**


Note:  This file will also be available via anonymous file
transfer from csrc.ncsl.nist.gov in directory /pub/nistnews and
via the NIST Computer Security BBS at 301-948-5717.
     ---------------------------------------------------

                         THE WHITE HOUSE

                  Office of the Press Secretary

_________________________________________________________________

For Immediate Release                           April 16, 1993


                STATEMENT BY THE PRESS SECRETARY


The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.

The initiative will involve the creation of new products to
accelerate the development and use of advanced and secure
telecommunications networks and wireless communications links.

For too long there has been little or no dialogue between o

In [None]:
display(Markdown(f'**Documento seleccionado:** {idxs[4]}\n'))
print(textos[4][0])

for i in range(len(textos[0])-1):
  display(Markdown(f'**Documento similar {i+1}:**\n'))
  print(textos[4][i+1])
  print('\n')

**Documento seleccionado:** 9900


To All -- I thought the net would find this amusing..
  
From the March 1993 "Aero Vision" (The newsletter for the Employees
of McDonnell Douglas Aerospace at Huntington Beach, California).
  
  SPACE CLIPPERS LAUNCHED SUCCESSFULLY
  
  "On Monday, March 15 at noon, Quest Aerospace Education, Inc.
  launched two DC-Y Space Clippers in the mall near the cafeteria.
  The first rocket was launched by Dr. Bill Gaubatz, director and
  SSTO program manager, and the second by Air Force Captain Ed
  Spalding, who with Staff Sgt. Don Gisburne represents Air Force
  Space Command, which was requested by SDIO to assess the DC-X for
  potential military operational use.  Both rocket launches were
  successful.  The first floated to the ground between the cafeteria
  and Building 11, and the second landed on the roof of the
  cafeteria.
  
  Quest's Space Clipper is the first flying model rocket of the
  McDonnell Douglas DC-X.  The 1/122nd semi-scale model of the
  McDonnell Douglas Delta Clipper 

**Documento similar 1:**


McDonnell Douglas rolls out DC-X

        HUNTINGTON BEACH, Calif. -- On a picture-perfect Southern
California day, McDonnell Douglas rolled out its DC-X rocket ship last
Saturday.  The company hopes this single-stage rocket technology
demonstrator will be the first step towards a single-stage-to-orbit (SSTO)
rocket ship.

        The white conical vehicle was scheduled to go to the White Sands
Missile Range in New Mexico this week.  Flight tests will start in
mid-June.

        Although there wasn't a cloud in the noonday sky, the forecast for
SSTO research remains cloudy.  The SDI Organization -- which paid $60
million for the DC-X -- can't itself afford to fund full development of a
follow-on vehicle.  To get the necessary hundreds of millions required for
a sub-orbital DC-XA, SDIO is passing a tin cup among its sister government
agencies.

        SDIO originally funded SSTO research as a way to cut the costs for
orbital deployments of space-based sensors and weapns.  However, rece

**Documento similar 2:**


Archive-name: space/groups
Last-modified: $Date: 93/04/01 14:39:08 $

SPACE ACTIVIST/INTEREST/RESEARCH GROUPS AND SPACE PUBLICATIONS

    GROUPS

    AIA -- Aerospace Industry Association. Professional group, with primary
	membership of major aerospace firms. Headquartered in the DC area.
	Acts as the "voice of the aerospace industry" -- and it's opinions
	are usually backed up by reams of analyses and the reputations of
	the firms in AIA.

	    [address needed]

    AIAA -- American Institute of Aeronautics and Astronautics.
	Professional association, with somewhere about 30,000-40,000
	members. 65 local chapters around the country -- largest chapters
	are DC area (3000 members), LA (2100 members), San Francisco (2000
	members), Seattle/NW (1500), Houston (1200) and Orange County
	(1200), plus student chapters. Not a union, but acts to represent
	aviation and space professionals (engineers, managers, financial
	types) nationwide. Holds over 30 conferences a year on space and
	aviation

**Documento similar 3:**


There is an interesting opinion piece in the business section of today's
LA Times (Thursday April 15, 1993, p. D1).  I thought I'd post it to
stir up some flame wars - I mean reasoned debate.  Let me preface it by
saying that I largely agree that the "Space Age" in the romantic sense
of several decades ago is over, and that projects like the space station
miss the point at this time.  Reading, for example, "What's New" -
the weekly physics update we get here on the net - it's clear that the
romance of the day lies in the ever more fine-grained manipulation of
matter: by which I include biotechnology, condensed matter physics (with
its spinoffs in computer hardware and elsewhere), and the amazing things
people are doing with individual atoms these days.  To a large extent, I
think, the romance some people still have with space is a matter of
nostalgia.  I feel sure that someday we - or more precisely, our "mind
children" - will spread across space (unless we wipe ourselves out); but
I t

**Documento similar 4:**


COMMERCIAL SPACE NEWS/SPACE TECHNOLOGY INVESTOR NUMBER 22

   This is number twenty-two in an irregular series on commercial 
space activities.  The commentaries included are my thoughts on 
these developments.  

   Sigh... as usual, I've gotten behind in getting this column 
written.  I can only plead the exigency of the current dynamics in 
the space biz.  This column is put together at lunch hour and after 
the house quiets down at night, so data can quickly build up if 
there's a lot of other stuff going on.  I've complied a lot of 
information and happenings since the last column, so I'm going to 
have to work to keep this one down to a readable length.  Have fun! 

CONTENTS:
1- US COMMERCIAL SPACE SALES FLATTEN IN 1993
2- DELTA WINS TWO KEY LAUNCH CONTRACTS
3- COMMERCIAL REMOTE SENSING VENTURE GETS DOC "GO-AHEAD"
4- INVESTMENT FIRM CALLS GD'S SPACE BIZ "STILL A GOOD INVESTMENT" 
5- ARIANE PREDICTS DIP IN LAUNCH DEMAND
6- NTSB INVESTIGATES PEGASUS LAUNCH OVER ABORTED ABORT
7- ANO

**Documento similar 5:**


Archive-name: space/controversy
Last-modified: $Date: 93/04/01 14:39:06 $

CONTROVERSIAL QUESTIONS

    These issues periodically come up with much argument and few facts being
    offered. The summaries below attempt to represent the position on which
    much of the net community has settled. Please DON'T bring them up again
    unless there's something truly new to be discussed. The net can't set
    public policy, that's what your representatives are for.


    WHAT HAPPENED TO THE SATURN V PLANS

    Despite a widespread belief to the contrary, the Saturn V blueprints
    have not been lost. They are kept at Marshall Space Flight Center on
    microfilm.

    The problem in re-creating the Saturn V is not finding the drawings, it
    is finding vendors who can supply mid-1960's vintage hardware (like
    guidance system components), and the fact that the launch pads and VAB
    have been converted to Space Shuttle use, so you have no place to launch
    from.

    By the time you 

En todos los casos se observa que los textos comparten similaridades en el contenido. Como caso particular se puede mencionar los documentos sobre temas médicos (id 1300) en los cuales el contenido no trata sobre el mismo tema, pero en todos los casos el fin del documento presenta la firma de la misma persona con una frase.

### 2) Modelo de clasificación Naïve Bayes

Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación (f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial y ComplementNB

 ### Multinomial Naïve Bayes

In [None]:
# es muy fácil instanciar un modelo de clasificación Naïve Bayes y entrenarlo con sklearn
clf = MultinomialNB()
clf.fit(X_train, y_train)

In [None]:
# con nuestro vectorizador ya fiteado en train, vectorizamos los textos
# del conjunto de test
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target
y_pred =  clf.predict(X_test)

In [None]:
# el F1-score es una metrica adecuada para reportar desempeño de modelos de claificación
# es robusta al desbalance de clases. El promediado 'macro' es el promedio de los
# F1-score de cada clase. El promedio 'micro' es equivalente a la accuracy que no
# es una buena métrica cuando los datasets son desbalanceados
f1_score(y_test, y_pred, average='macro')

0.5854345727938506

Realizo una busqueda de hiperparametros, en particular del valor de alpha

In [None]:
multinomialNB = MultinomialNB()

grid = GridSearchCV(multinomialNB,
                    {"alpha": np.arange(0.001, 1, 0.005), "force_alpha": [False]},
                    refit=True,
                    cv=5,
                    scoring='f1_macro',
                    n_jobs=-1
)

grid.fit(X_train, y_train)

In [None]:
grid.best_params_

{'alpha': 0.006, 'force_alpha': False}

In [None]:
multinomialNB_best = grid.best_estimator_

In [None]:
y_pred = multinomialNB_best.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.6822932391972965

### Complement Naïve Bayes

In [None]:
clfComplNB = ComplementNB()
clfComplNB.fit(X_train, y_train)

In [None]:
y_pred =  clfComplNB.predict(X_test)

In [None]:
f1_score(y_test, y_pred, average='macro')

0.692953349950875

Realizo una busqueda del mejor valor de alpha

In [None]:
complementNB = ComplementNB()

grid = GridSearchCV(complementNB,
                    {"alpha": np.arange(0.001, 1, 0.05), "force_alpha": [False]},
                    refit=True,
                    cv=5,
                    scoring='f1_macro',
                    n_jobs=-1
)

grid.fit(X_train, y_train)

In [None]:
grid.best_params_

{'alpha': 0.15100000000000002, 'force_alpha': False}

In [None]:
complementNB_best = grid.best_estimator_

In [None]:
y_pred = complementNB_best.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.6983191357775227

In [None]:
# Genero tabla con los resultados
df = pd.DataFrame(
    [[1, 0.585],[0.006, 0.682],[1, 0.692],[0.151, 0.698]],
    columns=['alpha','F1 macro'],
    index=['multinomialNB','multinomialNB_best','complementNB','complementNB_best'])

display(df)

Unnamed: 0,alpha,F1 macro
multinomialNB,1.0,0.585
multinomialNB_best,0.006,0.682
complementNB,1.0,0.692
complementNB_best,0.151,0.698


### 3) Vectorización de términos

Transponer la matriz documento-término. De esa manera se obtiene una matriz término-documento que puede ser interpretada como una colección de vectorización de palabras. Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares. La elección de palabras no debe ser al azar para evitar la aparición de términos poco interpretables, elegirlas "manualmente".

In [None]:
# Obtengo la matriz término-documento
X_train_td = X_train.T


In [None]:
# Selecciono palabras para analizar
palabras = ['medicine','technology','gun','space','electronic']
resultados = []
indices = []

# Obtengo las palabras mas similares
for palabra in palabras:

  idx = tfidfvect.vocabulary_[palabra]
  cossim = cosine_similarity(X_train_td[idx], X_train_td)[0]

  cossimVal = np.sort(cossim)[::-1][1:6]
  simIdx = np.argsort(cossim)[::-1][1:6]

  simPalabras = [f'{idx2word[id]} ({round(value,2)})' for id, value in zip(simIdx, cossimVal)]

  indices.append(palabra)
  resultados.append(simPalabras)

# Genero tabla con los resultados
df = pd.DataFrame(
    resultados,
    columns=['Termino Similar 1','Termino Similar 2','Termino Similar 3','Termino Similar 4', 'Termino Similar 5'],
    index=indices)

display(df)

Unnamed: 0,Termino Similar 1,Termino Similar 2,Termino Similar 3,Termino Similar 4,Termino Similar 5
medicine,strengthens (0.37),dislikes (0.35),nearer (0.3),foremost (0.28),neurodermitis (0.28)
technology,blatently (0.31),lecturing (0.31),ellul (0.31),christan (0.31),toffler (0.31)
gun,guns (0.36),crime (0.24),handgun (0.24),homicides (0.23),firearms (0.23)
space,nasa (0.33),seds (0.3),shuttle (0.29),enfant (0.28),seti (0.25)
electronic,towwang (0.23),wichever (0.23),caen (0.22),cruptology (0.22),fluourescent (0.21)


Al transponer la matriz de documento-termino, se obtiene una representación de los términos en forma de vector, en donde al analizar la similaridad coseno entre vectores se obtienen términos que tienen cierta cercanía en cuanto a su concepto. Se puede apreciar un buen caso al analizar los términos similares a “gun”, donde los términos más cercanos están relacionados con las armas (guns,  handgun y firearms )  y con la consecuencia de su utilización (crime y homicides).