Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Word Cloud #18

Closed
nickescobedo opened this issue Aug 15, 2017 · 3 comments
Closed

Build Word Cloud #18

nickescobedo opened this issue Aug 15, 2017 · 3 comments

Comments

@nickescobedo
Copy link

Saw this package and noticed on the wiki page it mentions building a word cloud, but the page is empty. https://github.com/yooper/php-text-analysis/wiki/PHP-Keyword-Phrases-Word-Cloud

How could I potentially go about building a word cloud with this package?

Thanks!

@yooper
Copy link
Owner

yooper commented Aug 18, 2017

Hello, sorry for my delayed response. To make the word cloud you will need D3 word cloud, https://github.com/jasondavies/d3-cloud

As for the PHP code, here is what I have done in the past ...

use TextAnalysis\Analysis\Keywords\Rake;
use TextAnalysis\Documents\TokensDocument;
use TextAnalysis\Tokenizers\WhitespaceTokenizer;
use StopWordFactory;
use TextAnalysis\Filters;

class WordCloud
{
    const NGRAM_SIZE = 3;
    
    /**
     * @var \TextAnalysis\Interfaces\ITokenTransformation[]
     */
    protected $tokenFilters = [];
    
    /**
     * @var \TextAnalysis\Interfaces\ITokenTransformation[]
     */    
    protected $contentFilters = [];    

    /**
     * The keyword scores are not setup in a compatible way with
     * what D3 cloud expects
     * @param array $keywordScores
     */
    public function getScaledScores($keywordScores)
    {
        $scaleFactor = 1 / array_sum(array_values($keywordScores));
        
        array_walk($keywordScores, 
            function(&$value, $key) use ($scaleFactor){                 
                $value = round($value * $scaleFactor, 5);
            });            
        return $keywordScores;
    }
    
    /**
     * 
     * @return \TextAnalysis\Interfaces\ITokenTransformation[]
     */
    public function getContentFilters()
    {
        if(empty($this->contentFilters)) {
            
            $lambdaFunc = function($word){
                return  preg_replace('/[^[:print:]]/', ' ', $word);
            };
            
            $this->contentFilters = [
                new Filters\StripTagsFilter(),
                new Filters\LowerCaseFilter(),
                new Filters\NumbersFilter(),           
                new Filters\EmailFilter(),
                new Filters\UrlFilter(),
                new Filters\PossessiveNounFilter(),
                new Filters\QuotesFilter(),
                new Filters\PunctuationFilter(),
                new Filters\CharFilter(),
                new Filters\LambdaFilter($lambdaFunc),
                new Filters\WhitespaceFilter()     
            ];
        }
        return $this->contentFilters;
    }
    
    /**
     * 
     * @return \TextAnalysis\Interfaces\ITokenTransformation[]
     */
    public function getTokenFilters()
    {
        if(empty($this->tokenFilters)) {
            $stopwords = StopWordFactory::get('stop-words-fox.txt');
            $this->tokenFilters = [              
                new Filters\StopWordsFilter($stopwords),
            ];
        }        
        return $this->tokenFilters;
    }
    
    /**
     * 
     * @param string $content
     * @return array
     */
    public function getKeywordScores($content)
    {        
        $tokens = (new WhitespaceTokenizer())->tokenize($content);       
        $tokenDoc = new TokensDocument(array_map('strval', $tokens));
        unset($tokens);
                
        foreach($this->getTokenFilters() as $filter)
        {
            $tokenDoc->applyTransformation($filter, false);
        }        
        
        // will return null values in an array
          
        $size = count($tokenDoc->toArray());
        if($size < self::NGRAM_SIZE || !array_filter($tokenDoc->toArray())) {
            return [];
        }           
        
        $rake = new Rake($tokenDoc, self::NGRAM_SIZE);
        return $rake->getKeywordScores();
    }

}

$cloud = new WordCloud();
$scores = $cloud->getKeywordScores("YOUR CONTENT GOES HERE")
// scales the scores for the D3 cloud library
$scaledScores = $cloud->getScaledScores($scores);

You must use $scaledScores with the D3 cloud library. Sorry for the incomplete example. Please post your completed solution and I will use it to update the documentation.

@nickescobedo
Copy link
Author

No problem, thank you for this! I'll report back after I try this.

I did get a working prototype with jQCloud and the getKeyValuesByWeight from the FreqDist class.

@yooper
Copy link
Owner

yooper commented Aug 21, 2017

Sounds good. I am closing this issue.

@yooper yooper closed this as completed Aug 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants