In [1]:
string= ''' 
1. Title: CNAE-9

2. Source Information
- Data set was initially used in: 
  Patrick Marques Ciarelli, Elias Oliveira, "Agglomeration and Elimination of Terms for Dimensionality Reduction",
  Ninth International Conference on Intelligent Systems Design and Applications, pp. 547-552, 2009

3. Past Usage:
- Patrick Marques Ciarelli, Elias Oliveira, "Agglomeration and Elimination of Terms for Dimensionality Reduction", 
  Ninth International Conference on Intelligent Systems Design and Applications, pp. 547-552, 2009:
  - Feature selection (900 instances for training and 180 instances for test): 
    - Best results using kNN (k=1):
      50 dimensions: 87.78% (LSI)
      100 dimensions: 92.78% (LSI)
      150 dimensions: 92.22% (LSI)
      200 dimensions: 92.78% (MI)
      250 dimensions: 92.78% (MI)

- Patrick Marques Ciarelli, Elias Oliveira, Evandro O. T. Salles, "An Evolving System Based on Probabilistic Neural Network",
  Brazilian Symposium on Artificial Neural Network, 2010:
  - Incremental learning (no off-line training):
    - Best result:
      88.71% (ePNN)

4. Relevant Information:

- This is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a
subset of 9 categories cataloged in a table called National Classification of Economic Activities (Classifica��o Nacional de
Atividade Econ�micas - CNAE). The original texts were pre-processed to obtain the current data set: initially, it was kept only
letters and then it was removed prepositions of the texts. Next, the words were transformed to their canonical form. Finally,
each document was represented as a vector, where the weight of each word is its frequency in the document. This data set is
highly sparse (99.22% of the matrix is filled with zeros).
 
5. Number of Instances: 1080

6. Number of Attributes: 857 (1 category, 856 word frequency)

7. Attribute Information:
   1. category: range 1 - 9 (integer)
   2 - 857. word frequency: (integer)

8. Missing Attribute Values: None

9. Class Distribution: the categories are equally distribuited. (120 instances in each of nine categories)

Summary Statistics:
                 Min   Max   Mean    SD
word frequency:   0     4   0.0082 0.0948
'''

In [3]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
t0 = time()
dataset = string.split('\n')

In [4]:
dataset

[' ',
 '1. Title: CNAE-9',
 '',
 '2. Source Information',
 '- Data set was initially used in: ',
 '  Patrick Marques Ciarelli, Elias Oliveira, "Agglomeration and Elimination of Terms for Dimensionality Reduction",',
 '  Ninth International Conference on Intelligent Systems Design and Applications, pp. 547-552, 2009',
 '',
 '3. Past Usage:',
 '- Patrick Marques Ciarelli, Elias Oliveira, "Agglomeration and Elimination of Terms for Dimensionality Reduction", ',
 '  Ninth International Conference on Intelligent Systems Design and Applications, pp. 547-552, 2009:',
 '  - Feature selection (900 instances for training and 180 instances for test): ',
 '    - Best results using kNN (k=1):',
 '      50 dimensions: 87.78% (LSI)',
 '      100 dimensions: 92.78% (LSI)',
 '      150 dimensions: 92.22% (LSI)',
 '      200 dimensions: 92.78% (MI)',
 '      250 dimensions: 92.78% (MI)',
 '',
 '- Patrick Marques Ciarelli, Elias Oliveira, Evandro O. T. Salles, "An Evolving System Based on Probabilistic N

In [5]:
print("Extracting tf-idf features...")
#First we initiate an empty tfidf object with specific conditions
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))#max_df=0.95, min_df=2, stop_words='english' #USE HELP TO SEE WHAT EACH DOES)
t0 = time()
#Next we give the data for processing
tfidf = tfidf_vectorizer.fit_transform(dataset)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features...
done in 0.006s.


In [6]:
tfidf.data

array([ 0.59572018,  0.53873456,  0.59572018,  0.60861628,  0.50908982,
        0.60861628,  0.26289057,  0.26289057,  0.26289057,  0.30331067,
        0.33539391,  0.26289057,  0.26289057,  0.33539391,  0.33539391,
        0.33539391,  0.33539391,  0.19456866,  0.19456866,  0.19456866,
        0.19456866,  0.19456866,  0.21035583,  0.16385854,  0.21035583,
        0.13905488,  0.21035583,  0.19456866,  0.21035583,  0.21035583,
        0.19456866,  0.19456866,  0.19456866,  0.19456866,  0.21035583,
        0.21035583,  0.21035583,  0.21035583,  0.21035583,  0.21035583,
        0.21035583,  0.21035583,  0.15783148,  0.2026185 ,  0.2026185 ,
        0.2026185 ,  0.17561695,  0.2026185 ,  0.2026185 ,  0.2026185 ,
        0.2026185 ,  0.2026185 ,  0.2026185 ,  0.2026185 ,  0.2026185 ,
        0.2026185 ,  0.2026185 ,  0.2026185 ,  0.2026185 ,  0.2026185 ,
        0.2026185 ,  0.2026185 ,  0.2026185 ,  0.2026185 ,  0.2026185 ,
        0.2026185 ,  0.2026185 ,  0.57735027,  0.57735027,  0.57

In [7]:
dense = tfidf.todense()
dense.shape
print(dense)

[[ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.41070443  0.41070443  0.41070443 ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]


In [8]:
feature_names = tfidf_vectorizer.get_feature_names()
print(len(feature_names))
feature_names[:3]

388


['0082', '0082 0948', '0948']

In [9]:
import pandas as pd
x = pd.DataFrame(dense)
x.columns = tfidf_vectorizer.get_feature_names()
x['text'] = dataset
x.to_csv('mytfidf.csv', index = False)
x

Unnamed: 0,0082,0082 0948,0948,100,100 dimensions,1080,1080 documents,120,120 instances,150,...,where,where the,with,with zeros,word,word frequency,word is,words,words were,zeros
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
